Improving performance with HTTP caching

Overview

The cantabular-server service is optimised to generate and serve responses to queries as efficiently as possible. To further improve overall performance and availability, an HTTP caching server may be deployed in front of the server. This approach is often particularly beneficial when the server is expected to deal with typical traffic patterns generated by users incrementally building and modifying complex queries.

For more information on HTTP caching in general, see https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching.

This document demonstrates how HTTP caching may be configured for cantabular-server using Nginx, a popular open source reverse proxy and caching server. For more information on Nginx, see https://www.nginx.com/.

Alternative reverse proxies with varying levels of caching support are available, including Varnish, Haproxy and Traefik.

Configuring the service to support caching

By default, cantabular-server will send HTTP headers allowing eligible responses to be cached by downstream servers such as Nginx.

For example, the following query executed against the “Example” dataset:

GET /v8/query/Example?v=city

Should result in a response including the following HTTP headers:

Cache-Control: must-revalidate, stale-while-revalidate=3600, s-max-age=5
ETag: "LCa0a2j/xo/5m0U8HTBBNBNCLXBkg7+g+YpeiGJm5644LjM3\"

The Cache-Control header value may be read by downstream caching servers as a signal to enable caching.

The ETag header value is also used by downstream caching servers. If the dataset is updated or the software version changes then the ETag value will change, and downstream caches will be invalidated.

The actual values of these headers in real-world deployments will be likely to differ from the example shown above.

The default configuration allows for responses to be served from the cache for up-to five seconds, before being re-validated on the back-end server. Generally, re-validation is a relatively cheap action compared to performing the full query again. This configuration is recommended to reduce the risk of stale data being served to clients, and ensure the maximum transparency and maintainability of the system.

The environment variable CANTABULAR_API_HTTP_CACHING_MAX_ROWS can be used to set an upper limit on the size of a query output that can be cached. If a query output has more rows than the provided limit, then an ETag header will not be set.

Note that there is currently a hard-coded upper limit of 50,000 rows that takes precedence over any limit defined in the CANTABULAR_API_HTTP_CACHING_MAX_ROWS environment variable.

Disabling caching

It is possible to disable cache-related response headers by setting the following environment variable when launching cantabular-server:

CANTABULAR_API_HTTP_CACHING_OFF=1

The same request will then result in the following HTTP response headers being sent:

Cache-Control: no-store

This should result in caching being disabled in any downstream servers.

Configuring Nginx

Below is a simple example of an Nginx configuration that has been tested successfully with cantabular-server.

Note that it is not intended as a complete, production-ready configuration. Please refer to the Nginx documentation for all of its configuration options.

user nobody;

error_log /dev/stdout info;

events {}

http {
  access_log /dev/stdout;

  # Define proxy cache and set storage location.
  # Cached data will be stored in `conf/proxy_cache`.
  proxy_cache_path conf/proxy_cache levels=1:2 keys_zone=core:10M;

  server {
    listen 8493;

    # Increase Nginx's maximum client header buffer.
    # This ensures that GET request URIs of up to 16kB will be allowed, allowing users
    # to construct complex queries up to this length. Requests containing longer URIs
    # will return an error with 414 status code.
    #
    # This setting should be updated according to your users' requirements:
    large_client_header_buffers 8 16k;

    location / {
      # Enable proxy cache for this location:
      proxy_cache core;

      # Set connection timeout:
      proxy_connect_timeout 3s;

      # Populate standard HTTP headers:
      proxy_set_header        Host            $host;
      proxy_set_header        X-Real-IP       $remote_addr;
      proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;

      # Force Nginx to enable request revalidation. Required for Nginx to
      # recognize the must-revalidate and stale-while-revalidate response headers:
      proxy_cache_revalidate on;

      # Proxy requests to this address, where `cantabular-server` should be listening:
      proxy_pass http://localhost:8491;
    }
  }
}