The Fundamentals of Managing API Rate Limits: Developers Best Practices

Today, we're diving headfirst into the fascinating world of API rate limits. Whether you're battling the elusive 429 error or crafting your own rate limit strategies, remember that the API universe is vast, and each encounter enriches your journey.

Lunar.dev Team

Rate Limits

Getting Started

Today, we're diving headfirst into the fascinating world of API rate limits. You can think of them as traffic cops working the API beat: always on duty and dedicated to maintaining order, enforcing certain logic and rules, and keeping accidents to a minimum.

Fasten your seatbelts, because we're about to embark on a technical journey through the intricacies of 3rd-party API rate limits.

So What Are We Limiting?

Rate limits dictate the maximum number of requests a client (that's you or your app) can make within a specific timeframe. These limits are not arbitrary; they are carefully crafted rules set by the API provider. Rate limiting is often favorable to providers who use it to control the rate at which requests are made to an API – and each API is different.

We are limiting API requests. Limits are placed on the number of API requests that each user can make during a set period, or concurrently, using one of several algorithmic methods. The nature of how these limits are set varies between apps since they provide different services.

For instance, we spoke to a banking app provider who limits the number of transactions each user can request within a given timeframe. With ChatGPT, API, which deals with words and images, rate limits are measured in terms of requests per minute, requests per day, tokens per minute, tokens per day, and images per minute. In this context, a token is a single character of text, a space, or a punctuation symbol. Using any combination of these inputs will count towards the rate limit.
Another example - with an API like GPT 4, each request call can be a resource-intensive and costly. At the time of writing, 1000 prompt tokens cost $0.03 for the 8k context and $0.06 for the 32k context, while 1000 completion tokens cost $0.06 and $0.12 respectively - hence the need to limit such resources.
In some cases, rate limits become a means to control the rate of requests and prevent overloading of the API’s servers. Some companies prioritize conserving CPU resources, and they achieve this by implementing lower rate limits.

Requests Per Minute (RPM):

RPM is a common approach to rate limiting.
It establishes a strict limit on the number of requests a user or client can make within a one-minute timeframe.
For example, if an API enforces a rate limit of 1000 RPM, a user can fire off up to 1000 requests in a single minute.
RPM offers finer control over request rates by focusing on precise time intervals.

Concurrent Connections:

The concurrent connections method is a different breed of rate limiting that restricts the number of simultaneous connections a user or client can establish with the API server.
Unlike the previous two methods that deal with requests over time, the concurrent connections limit governs the active connections allowed at any given moment.
For instance, if an API sets a concurrent connection limit of five, a user can maintain up to five connections concurrently.
This approach has proved invaluable in scenarios where managing server connection loads is critical.

‍

Ready to Start your journey?

Get Early Access With Lunar and a free API Overview Report

‍

How Are We Limiting Request Rates?

As previously mentioned, with algorithms. Here are the most common examples:

Leaky Bucket Algorithm:

The Leaky Bucket algorithm is a simple rate-limiting algorithm that enforces a constant output rate of tokens or events. It is represented by a "bucket" with a fixed capacity. Tokens are added to the bucket at a constant rate. When a request or event arrives, a token is required to proceed. If there are enough tokens in the bucket, the request is allowed, and a token is consumed. If there are not enough tokens in the bucket, the request is delayed or dropped. Excess tokens that exceed the bucket's capacity are discarded.

Key features:

Constant rate of output.
Requests are delayed if tokens are insufficient.
Simple and straightforward.

Token Bucket Algorithm:

The Token Bucket algorithm is a rate-limiting algorithm that allows bursts of traffic while still maintaining an average rate limit. Similar to the Leaky Bucket, it also uses a bucket metaphor, but tokens are added to the bucket at a variable rate. Each request consumes one token from the bucket. If there are not enough tokens, the request is delayed or discarded. Unlike the Leaky Bucket, the Token Bucket allows for bursts of traffic as long as the bucket has tokens to spare.

Key features:

Average rate limiting with the ability to handle bursts.
Requests are delayed or discarded if tokens are insufficient.
More flexible than the Leaky Bucket for accommodating variable traffic patterns.

Fixed Window Rate Limiting:

‍In this simple rate-limiting approach, you track the number of requests that occur within fixed time windows (e.g., one minute).

If the number of requests exceeds a predefined limit within the window, excess requests are either delayed, discarded, or subject to other actions. This algorithm can be less flexible than Token Bucket for handling bursts but is easy to implement.

Use Cases:

Avoiding “Noisy Neighbor” collisions: In situations where a company consumes APIs providers on behalf of it’s customers, it becomes essential to establish an active throttling policy. This policy ensures that consumption remains below a certain threshold, preventing any adverse impact on both the customer's performance with it’s API provider.
Implementing an active client-side throttling policy: By proactively defining a consumption restriction policy, the API consumer can safeguard against overloading the API provider's resources, which could otherwise lead to degraded performance for the consumer. This best practice not only prevents excessive API calls but also maintains optimal performance levels for both the API consumer and the API provider.
Quota Management: When multiple different services or environments are using the same API, it is often necessary to allocate a different quota to each service or environment. This can be achieved by grouping requests by a header value and allocating a different quota to each group. This allows, for example, to allocate a different quota to each environment (e.g. staging, production, etc.) or to each service (e.g. service1, service2, etc.). The header names and values are configurable, allowing for maximum flexibility.

Examples:

In the following example, the plugin will enforce a limit of 100 requests per minute for all requests. If the limit is exceeded, the plugin will return a 429 HTTP status code.

‍


endpoints:
  - url: api.com/resource/{id}
    method: GET
    remedies:
      - name: Strategy Based Throttling
        enabled: true
        config:
          strategy_based_throttling:
            allowed_request_count: 100
            window_size_in_seconds: 60
            response_status_code: 429

‍

In this example, the plugin will enforce a limit of 80 requests per minute for requests containing the X-Group: production header and 20 for X-Group: staging.

‍


endpoints:
  - url: api.com/resource/{id}
    method: GET
    remedies:
      - name: api.com Strategy-Based Throttling
        enabled: true
        config:
          strategy_based_throttling:
            allowed_request_count: 100
            window_size_in_seconds: 60
            response_status_code: 429

          group_quota_allocation:
            group_by:
              header_name: X-Group
            groups:
              - group_header_value: "staging"
                allocation_percentage: 20
              - group_header_value: "production"
                allocation_percentage: 80

‍

NOTE: The percentage configuration is unique for each header, which means you have the flexibility to set one header to 100% while assigning a lower value, such as 20%, to another header.

Common Rate Limit Errors and Management Lessons Learned

Let's shed some light on a few errors that can trip up your rate limit ballet.

1. The Infamous 429 Error:

‍

In the realm of API rate limits, the number 429 holds a special place. It's the status code that screams, "Too many requests!" Issues may arise when providers don’t supply you with a 429. GitHub, for instance, prefers to scold you with a 401 status code when you misbehave instead of showering you with 429s.

But here's a twist – 429s aren't just for hollering at you about excess requests. They also cover cases where you've exceeded your allotted quota. Sometimes these limits are like elastic bands, stretching to accommodate your usage but then snapping when they’re stretched too far.

2. Exceeding the Invisible Quota:

Now picture a world where the API provider forgets to set rate limits for some clients – their elastic band seems to stretch forever…

These clients live in blissful ignorance, unaware of their usage limits until the inevitable day when the provider rings them up and says, "Stop now or we’re blocking you!" It's a scenario that often unfolds when the provider isn't API-first.

In a similar tale of woe, we heard of a client who used up their entire API quota for the production environment during testing. If only they’d had Lunar.dev on the case because our Strategy-Based Throttling would have saved them. It allows clients to set their own limits without invoking the wrath of the provider and suffering the ensuing embarrassment.

3. Headers and Penalties:

Some APIs send headers to tell you when to retry, but if you don't pay attention, you might find yourself getting penalized. Ignoring retries can lead to back-offs or exponential back-offs, and because these hints are often hidden in headers, many companies overlook them.

Also, keep in mind that not all SDKs and libraries are created equal. Some may solve rate limits for specific use cases, but when you try to apply them to other scenarios, things get tricky. One size doesn't always fit all.

Tips for Rate Limit Mastery

Inspect Headers: Valuable nuggets of retry information are often hidden in headers. Don't ignore them; they can save you from all sorts of rate limit troubles.

Dropped Requests Queue: Build a queue for the delayed calls that didn’t make it and send them later on.
‍
Prioritize API calls: Deal with those hitting limits first.
‍
Define thresholds: Both for soft limits and hard limits. Awareness means you can take proactive measures to avoid hitting them.

Documentation is Key: Use it to check the rate limits for each provider in advance. Documentation works like a treasure map, it’s your helpful guide for a smoother API integration journey.

Centralize Your 429 Handling: Wrangle those pesky 429s in one place and you'll be equipped to deal with exponential backoffs. It's like having a toolkit for rate-limit resilience.

Strategy-Based Throttling by Lunar.dev: Consider Lunar.dev's Strategy-Based Throttling. This is a client-side rate limit implementation that empowers you to set your own limits, sparing you from a range of dreaded provider errors.

Conclusions

In the world of API rate limits, every error is a lesson, and every scenario is a challenge, so, when working with APIs, it's essential to adopt strategies to handle rate limits effectively:

Apply Rate Limiting Mechanisms: Implement rate-limiting mechanisms in your code to ensure compliance with the API's limits. Many API libraries and frameworks offer built-in support for rate limiting.

Backoff and Retry: When you encounter rate-limiting errors, use exponential backoff and retry strategies to wait and then retry the request.

Monitor Usage: Regularly monitor your API usage to understand how close you are to hitting rate limits. This proactive approach can help you manage your application's performance.
Set thresholds and alerts: Add to monitoring to stay informed.

‍

As developers, our mission is to navigate these intricacies with supreme elegance and technical finesse. So, whether you're battling the elusive 429 error or crafting your own rate limit strategies, remember that the API universe is vast, and each encounter enriches your journey.

This holiday season (and in 2024): may your rate limits be balanced, your headers informative, and your API integrations works of art!

Read more about Lunar.dev plugins for Rate Limit issues or test out the OSS.

‍

Ready to Start your journey?

Manage a single service and unlock API management at scale

The Fundamentals of Managing API Rate Limits: Developers Best Practices

Lunar.dev Team

Rate Limits

Getting Started

So What Are We Limiting?

Ready to Start your journey?

How Are We Limiting Request Rates?

Leaky Bucket Algorithm:

Token Bucket Algorithm:

Fixed Window Rate Limiting:

Common Rate Limit Errors and Management Lessons Learned

1. The Infamous 429 Error:

2. Exceeding the Invisible Quota:

3. Headers and Penalties:

Tips for Rate Limit Mastery

Conclusions

Ready to Start your journey?

The Missing Layer in AI Infrastructure: Aggregating Agentic Traffic

Wrapping Untrusted MCP Tools with Lunar MCPX Tool Customization

Safely Running Untrusted MCP Servers with Lunar MCPX and Gateway Integration