Rate-Limiting Your Serverless Endpoints Without A Database

By Tomi Chen June 1, 2021

Serverless is all the rage in modern web development, since it’s highly scalable, low cost, and distributed. Frontend developers are also empowered to create fully-featured websites without needing to manage backend infrastructure. One component is serverless functions—threads that spin up when a request is made and turn off when they’re not needed. This is great for costs since you don’t need to pay for something that’s not running, but it also means that memory is cleared and each request starts fresh, so any rate-limiting information needs to be persisted to a database. Right?

NOTE: The methods presented below are not intended for any actual use for rate-limiting infrastructure and should be viewed as interesting and educational. For more information, check out the “Conclusion” section at the end of this post.

Background

Over the weekend, I built iscancelled.com, a fun website where you can cancel various things. The URL would end up being *.iscancelled.com. For grammatical accuracy, I also got *.arecancelled.com, just in case plurals are needed. To store the number of cancels each item has, I used Upstash, a hosted Redis service. Since the free tier only has 10,000 requests per day, I thought it might be interesting to try rate-limiting certain endpoints to prevent excessive spam. For more information about this project, see the GitHub repo.

Since the entire point was to prevent excessive operations, storing rate-limiting information in the database defeated the whole point. Since I thought each endpoint would spin up on every request and clean up after, I thought I was out of luck.

That was when I stumbled upon this API route rate-limiting example in the Next.js repo. This surprised me. There was no external database, but it still worked. What was going on?

Note: I used Vercel for this project, so the following information was only tested on Vercel, though I suspect it may work on other platforms too.

Exploring the Code

The code for rate-limiting is in utils/rate-limit.js. Here, they use a least-recently-used (LRU) cache to store rate-limiting information. An LRU cache is basically a simple key-value store that evicts the least-used key once it hits a max size. When a key is read or set, the “recency” is updated. This implementation also uses a maxAge, which is checked when the key is read and deleted if the key is too old.

const LRU = require('lru-cache')

const rateLimit = (options) => {
  const tokenCache = new LRU({
    max: parseInt(options.uniqueTokenPerInterval || 500, 10),
    maxAge: parseInt(options.interval || 60000, 10)
  })

  return {
    check: (res, limit, token) =>
      new Promise((resolve, reject) => {
        const tokenCount = tokenCache.get(token) || [0]
        if (tokenCount[0] === 0) {
          tokenCache.set(token, tokenCount)
        }
        tokenCount[0] += 1

        const currentUsage = tokenCount[0]
        const isRateLimited = currentUsage >= parseInt(limit, 10)
        res.setHeader('X-RateLimit-Limit', limit)
        res.setHeader(
          'X-RateLimit-Remaining',
          isRateLimited ? 0 : limit - currentUsage
        )

        return isRateLimited ? reject() : resolve()
      })
  }
}

export default rateLimit

Here, the check function first checks if the cache contains the token (which identifies a user). If it doesn’t, it defaults to an array containing a single zero. Next, it checks if the value in the array is a zero. This would mean that the token doesn’t exist, since if it did, there should be at least a value of 1 stored in the cache. If the token didn’t exist, it now sets the cache to the array referenced by tokenCount. Finally, it increments the first element in tokenCount by one.

You may be wondering why the count is being wrapped in an array, especially when only one element is being used. I was wondering the same, so I ✨ thought about it ✨. Initially, I thought it was a way to avoid updating the “recency” of the cache, but then I realized that the “LRU” part of the LRU cache wasn’t even being used! Ideally, no keys will be evicted from maxing out capacity, since that would mean rate limits being cut short.

After trying it out without the array wrapping, I discovered that my initial reaction wasn’t too far off. However, it was to prevent the maxAge timer from being reset, not the recency of the key. This works since arrays in Javascript are passed by reference, not by value, so when the inner number is updated with tokenCount[0] += 1, the value in the cache is also updated, without triggering a maxAge reset that you’d get with cache.set(). Primitives in Javascript, such as numbers, are passed by value, so this doesn’t work without the array.

You can explore this yourself!

// x and y are primitives
function swap(x, y) {
  let tmp = x
  x = y
  y = tmp
}

let x = 1
let y = 2
swap(x, y)
console.log(x, y) // 1 2

// x and y are arrays
function swap(x, y) {
  let tmp = x[0]
  x[0] = y[0]
  y[0] = tmp
}

let x = [1]
let y = [2]
swap(x, y)
console.log(x, y) // [ 2 ] [ 1 ]

Exploring the Platform

Now that we understand what’s going on in the code, we can move on to the memory persistence after deployment. I wrote up a quick SvelteKit endpoint to see what was going on.

import type { RequestHandler } from '@sveltejs/kit'

const buffer = []

export const get: RequestHandler = async () => {
  buffer.push({ time: new Date().toString() })
  return {
    body: buffer
  }
}

This initializes an array, appends the current time to it every request, and returns the array. After deploying, we can reload the page and watch the array grow bigger and bigger. However, after leaving it for about 15 minutes or so, the array would clear itself and start over. Interesting!

This hinted to me that this memory persistence might have to do with cold starts and warm endpoints.

When a serverless function endpoint is first requested, there is a short delay of a few hundred milliseconds to boot up the function. This is referred to as the “cold-start” time. Since having this delay impacts performance significantly, platforms do not destroy functions immediately. Instead, they keep them “warm”, so if another request comes within ~5-25 minutes, the cold-start time is not an issue. Memory and state are shared if a function is warm, so that’s why this persistence occurs. (Cloudflare Workers does things a little differently, booting up the worker during the TLS handshake which lets them boast a 0ms cold-start time.)

Conclusion

We’ve seen how it is possible to build a simple rate limit in a serverless function, without the need for an external persisted database. Since this method does rely on the function staying alive, it’s not 100% foolproof, especially for longer durations. Keeping the maxAge below 5 minutes or so would probably be a better idea, since if the function is killed for inactivity, there weren’t that many requests anyway. There might also be multiple functions active during periods of higher load, so it might not work there. For a simple, low-stakes, low-traffic site, this works fine, but you definitely should reconsider for critical or higher-traffic use cases.

UPDATE: It has been brought to my attention that if someone is flooding your endpoint with requests, new functions are spun up very quickly (which is exactly what these are designed to do). As a result, any volume of more than a couple of requests per second will defeat this rate-limiting method. Please consider this as something interesting and educational, NOT something you should be doing in production applications.

I’ve also only tried this on Vercel (which I believe uses AWS Lambda under the hood), but I think it should work on other platforms too (at least the ones that use AWS). Try it and see!

Finally, here is the code for rate-limiting used on iscancelled.com. I’m also identifying users by IP, exposed through Vercel’s x-real-ip header.

PREV POST NEXT POST