Optimising Your GraphQL Request Waterfalls

Published on

As a technology, GraphQL is starting to gain traction. There are server implementations in a good selection of languages, and client support isn’t far behind. If you want to build your own GraphQL server, there’s plenty of information out there, but if you want to build an efficient one, there’s less guidance. I’ve been using GraphQL since it was a technical candidate, and have made plenty of mistakes along the way, but have been successful in making my GraphQL servers perform roughly as well as bespoke static endpoints.

I use Relay (a React library) on the client-side, and it requires a few conventions in your schema to function correctly. But whether you’re using Relay or not, you’re likely to find its conventions useful to follow. The most obvious one is to think in terms of Nodes, Edges and Connections:

A fictitious example

To demonstrate my approach to optimizing GraphQL servers, I’ll use a simple example of a screen in an app/website that shouldn’t be too far removed from a real-world scenario. Then, I’ll transition step-by-step from a naive implementation to an efficient one. To clarify what we’re aiming for, an optimal GraphQL server is one that returns the data for any arbitrary query as efficiently as possible.

Some other terminology that you should bare in mind before we proceed:

Events app mockup.

This mockup is for an events app — we’re looking at a screen which shows the list of upcoming events for some user (not necessarily the viewer). Each event card includes details about its venue, tags and a list of the viewer’s friends that are attending. We also show whether the viewing user has RSVP’ed for each event. This example has been constructed to represent a believable, real-world use case (it’s similar to real screens I’ve built), and also one that we’d expect to have poor performance if implemented naively.

Example query, and fields that correspond with data requests.

Version 1: Simple but slow

Let’s start with the simplest implementation. I’m assuming you have some knowledge of resolver functions, but essentially when resolving any field you have access to its parent object and its arguments (if any). The default resolver for any field is to look up the field’s own name on the parent — the best resolver is the default resolver. But when you want to get related data, you’ll typically make an asynchronous (non-blocking) request to your backend. We end up with a series of calls that looks like this:

The simplest way to fetch all this data without having to do anything clever in the GraphQL server. The relative durations aren’t important here.

This isn’t great, and in reality it’s likely to be much worse. We’re looking at 20 potentially concurrent requests to getUser alone. If our UI were to go deeper we’d be dealing with hundreds of requests for the overall query. This level of overhead for a single client query is extremely undesirable.

Note: You may be wondering why we don’t have getUpcomingEventIdsForUser return the events themselves rather than just the IDs, and similarly for getViewerFriendIdsAttendingEvent. Surely this would result in far fewer requests? You’d be correct, and you may choose to approach it either way. I split up the concerns because, in my projects, the data for edges often lives in a different service from the node data itself. Additionally, if your data contains a lot of repetition (i.e you see the same event or user multiple times in a single query), splitting things up may perform better. When I’ve had scenarios where I’m certain there’ll be no repetition due to uniqueness (e.g comments for a blog post), I’ve handled it as a single request.

Version 2: Batching up the nodes

We need to start batching up some of these requests: instead of asking for one thing at a time, we can ask for multiple. Most backend stores should allow for some kind of batch lookups; for SQL it might be an ‘IN’ query, whilst for Redis it would be an MGET. We want to be using these capabilities, rather than forcing them to handle numerous concurrent calls for one item at a time.

What might not be clear from the previous timeline, is that each iteration of a level of nested calls can complete at different times. We could be fetching the venue for event 2 before we’ve even finished fetching event 1. We want to avoid this as much as possible, because the more things we keep in sync, the more data requirements we can batch up.

For our nodes (users, events, venues and tags), as long as we have some way to collect their IDs, we can easily switch them over to using batch calls:

Second attempt, notice how many of the calls have been batched together.

Technical interlude

The “how” of batching is going to vary depending on which language and GraphQL implementation you’re using. Some implementations come with native support, whilst others require you to come up with your own solutions or use a library. Here are some of the approaches I’m aware of

Version 3: Batching up the edges

We still have a call to getViewerFriendIdsAttendingEvent for each event, so there’s no reason why we can’t batch up this call as well. The reason people often don’t is that it’s possibly less obvious how you’d batch up the pagination aspect of connections; but, there’s no reason why you can’t. It may not always be possible for your backend to handle it efficiently, but it’s still better to push it down to being an implementation detail. By doing this, we can get rid of the last group of unnecessary calls.

Note: Depending on batch size, we may choose to split them up into more calls, as it’s likely there are upper limits to the efficiencies of batching.

We finish up the batching, now we have no fan-out of requests based on page sizes.

So, are we done now? We’ve minimized the number of calls we’re making to our backend, but there’s one more thing we can do.

Version 4: Looking ahead

The observant amongst you might have noticed that it’s still possible to make this faster. In fact, if you were writing a bespoke API to return this data, it’s likely you’d have already done it.

The final version, we’ve divorced loading logic from field-level resolution, allowing optimal fetching.

Wait, what just happened? Previously, we were making a lot of calls after others had completed, because the GraphQL execution model traverses the tree represented by the query, resolving children only after the parent is ready. But in many situations, we don’t actually need the data from the parent in order to successfully make some of the calls for its descendants. In this particular case, we don’t need the user before requesting their events — the only required data is the user ID, which we already have. Likewise, we don’t need to wait for the events to load before requesting their attendees — e have the required list of event IDs. The end result is a response time that is a significant improvement on what we had previously.

The lesson here, is that whilst GraphQL lends itself to loading data at the point a field is being resolved, this isn’t suitable for an optimized chain of requests. But in all GraphQL implementations I’ve looked into (graphql-js, Absinthe and Sangria) it’s possible to look down the query AST for descendant fields and know what data is going to be needed, and therefore request it early.

One caveat that should be considered is that if some of the fields don’t have value (maybe an event or user doesn’t exist), you may end up never using some of the data you’ve requested. If your miss rate is high (maybe your website has an unusually high 404 rate), this might be inefficient. But I expect that for most scenarios this won’t be a major issue.

This has been a high-level look into optimizing GraphQL from the perspective of data requests. I like to work from this perspectivea as it gives good insight into where the big performance wins are — if you can easily visualize the critical chains of requests, it’s easier to focus your efforts. For example, we can look at our final solution and see that we’d get the most value from improving the response time of getUpcomingEventsForUser and getViewerFriendsIdsAttendingEvent. In our original solution, we might have prioritized improving getUser instead.