April 20, 2021
We have been using graphQL at scale for several years. Highlighting and resolving slow queries is important to the help of the overall system. Early in the life of the application, we implemented and slow logger plugin which will log queries over a certain amount of time. It also allows people to override the default time for queries. This is very useful for mutations that use 3rd parties and are expected to be slow.
We utilise jaegur for tracing calls between applications for a specific request. This gives great visibility for a slow request to see where the time was spent. On issue we had with graphQL was it would be hard to see which query was being invoked in jaegur. To deal with this we updated the httpLink in the clients to pass the queryNames as query parameters to the graphQL request.
We also enabled GQL tracing in none production environments, this returns the tracing information showing which resolver was slow in a query. This then allows use to look into these slow resolvers and fix the cause.
One resolution for slow calls was to implement caching for external connectors. Generally, we use a centralised cache, such as redis, to help speed up repeatable calls. In the rest client we use to call other systems we added metrics. These would help us look for calls that could be cached. These calls should be repeatable help volume calls that are slow, or have periods of slow responses. To ensure these caches are performing, when we add a cache we also add metrics to see the hit rate for these caches. If the hit rate is too low we remove the cache and look for other ways to improve the response times.
Indifiying slow queries is very important to ensure the system can scale. Utilising tracing can help focus on the cause of the slow query. It is also important when caches are applied to a system that the hit rate for these caches are monitored and they are tunned for optimal performance.
Follow me on twitter @andyianriley
or see andyianriley @ linkedin.