How to Reduce 3rd Party API Costs: Part II
While building middleware to control soaring costs and maintenance for OpenAI and such is tempting, it’s no easy feat. Since consumption is continuing to grow for scaling companies, our recommendation at the moment is to employ a dedicated infrastructure service.
In this post we’ll cover: Hidden Cloud Costs of Using Open AI, Consumption Management Must-Haves like Usage Visibility and API Consumption Controls and lastly, Optimisation Techniques, like Prompt Adaptation, LLM Cascade, and Caching (part of LLM Approximation).
Hidden Cloud Costs of Using OpenAI
Generative AI is becoming the beating heart of development and RnD this year. Right now the biggest tech companies are collaborating with the biggest Gen AI providers. Microsoft has paired with OpenAI, Amazon announced this September (2023) that it will invest $4 billion in startup Anthropic, and Google has linked up with DeepMind. API calls to those Gen AI Cloud providers are expensive, so the companies that use them must look for ways to understand their costs so they can control them.
While all these LLMs are similar, they’re also different in terms of the quality of their outputs and the value of their pricing, so if you’re using (or thinking of using) any of them in your production environment, you need to fully understand your API consumption and what it’s costing you.
A Maturing Market
While ChatGPT was once regarded as ‘nice-to-have’, its business model was not all that clear and many people just tinkered with it, but now it’s matured from a novelty into a pivotal tool for business. The biggest companies are consuming ChatGPT (or other LLMs) in large volumes in production, and Open AI recently announced that the enterprise-grade version of ChatGPT is on its way.
That’s great, but when every business is using the same game-changing technology, the success of one over another will come down to which of them can best control their costs and boost their efficiency.
Let’s take a look at how costs can drastically affect outcomes in this hypothetical example. Let’s say we want Gen AI to summarise all of Wikipedia, preserving the meaning of each of its six million articles but reducing their word count from 750 to 375. Now let's say that each 750-word prompt (which is a standard ChatGPT benchmark) that we feed into each model will cost us 1000 tokens.
Here’s how the costs would look for three different LLM models:
Curie charges two dollars per million tokens, both for prompts and responses, so we’ll pay around $18,000 to summarize Wikipedia. Anthropic will cost us roughly $160,000, but the bill for ChatGPT will dwarf them both at $360,000! These kinds of costs are just the tip of the iceberg in terms of data processing in today’s companies. When you consider that large and even medium-sized companies may routinely need to make even larger API requests, a 20x difference between two providers makes the need to control spending extremely urgent.
To do this you absolutely need a cost control strategy that provides visibility in real-time—because you can only control what you can see and understand. It should offer usage controls to impose boundaries on API calls and costs, and it should include optimization policies to get the best value out of every API call.
OpenAI Usage Visibility
OpenAI’s management panel gives you some basic visibility that includes your monthly billing, and Datadog recently created its own dedicated dashboard for OpenAI (and you can request access to the beta here), which includes things like real-time tracking of API calls and token requests. While this is an improvement, we believe that for the best visibility, you need to track OpenAI’s three rate limits yourself, and they are:
- RPD — requests per day
- RPM — requests per minute
- TPM — tokens per minute.
You should also track the different LLM models you’re using as well as response sizes since both add costs.
Displaying your own metrics for every API call and response on your own dedicated, real-time monitoring dashboard is the gold standard that will give you the necessary information to impose intelligent boundaries and so exercise greater cost control.
With any API call (and response) in ChatGPT you can extract a ton of useful data from headers in real-time. Request limits, token limits, quotas remaining, completion tokens (for responses), and prompt tokens (for requests) are all available in real-time, giving you the ability to obtain highly detailed data about your consumption patterns.
As a proof of concept for this, we at lunar.dev collaborated with the AutoGPT community to measure the API’s real-time usage, byintegrating it with our Egress API proxy. As a result the Lunar.dev team discovered that with no (or poor performing) consumption limits in place on the client side the ChatGPT autonomous agent made excessive API calls, and experienced exponential backoff. This resulted in excessive cost and poor performance and brought home to us the need to track calls using API consumption controls. It should come as no surprise that even in the official repository you will find warnings that ChatGPT is expensive and that you should monitor usage (see below).
Watch how this is done - the full demo by lunar.dev here:
Setting the Right Third-Party API Consumption Controls
Separate Usage Across Environments
To control consumption, you need to track API usage, but you also need to separate it out across environments. You can use the same API token across multiple environments, such as staging and production, but that can cause problems and exhaust that single resource. This is why it’s important to use a separate key for each environment.
ChatGPT has a memory, it stores your previous calls so it’s sometimes necessary to use the same API and functionality across different environments. This means you will need to share the same API key, and to do so you will need to generate sub-tokens from that API token, which is something the Lunar Proxy lets you do. Check out our quota allocation sandbox here to see exactly how it’s done.
Implement Rate Limits
Next you need to implement a rate limit for TPM, RPM, and RPD (in OpenAI) which means you need to understand both the rate limits in real-time and how close you are to reaching them. Best practice would also recommend setting alerts and active controls to let you know when you reach them.
Hard and Soft Limits
A hard limit generates a 429 response, but a soft limit is more flexible. If you know the actual real-time rate limit, then you can set your own soft limits. OpenAI lets you do this from its control panel.
Manage Retry Attempts
It is important to manage retries because even failed API calls will cost you and they can easily cascade.
API Consumption Optimisation Techniques for OpenAI
Here are some important points to remember:
- Every prompt and every response costs you money.
- The shorter the prompt, the fewer the tokens it uses and the less it will cost you.
- The shorter the API response the less you will pay.
This first prompt adaptation technique combines multiple inputs within a single call to reduce the frequency of API calls and make the best use of each token. Combining queries reduces the overhead in terms of metadata and the structure of the JSON, and also uses fewer tokens.
With this approach your prompts include examples, usually in a prompt/answer format such as, “Here’s an example of the answer for X and Y, so now give me the answer for Z.” This is called few-shot prompting and some LLM models provide tools to do this.
Drop Unnecessary Words
Reduce your queries to the bare minimum, including unnecessary punctuation. There’s no need to be polite to an LLM and doing so will cost you more without improving the accuracy of the response. Reducing prompt tokens per API call results in reduced costs of between 30 and 50%.
Model Chaining Optimizations
As the name suggests you can combine call batching, prompt optimization, and prompt selection, effectively chaining the logic before the API call is made. This is something that Lunar’s API Proxy lets you do, so you can chain and model all of those techniques before making an actual API call in the production environment.
As well as optimizing the input you can do the same with the output, which you’ll recall also carries a cost to you in tokens and thus money. It’s possible to reduce the cost of responses without reducing their accuracy by limiting the maximum number of tokens to generate in the completion, and also the maximum number of completions to be generated for each prompt.
As we saw at the beginning, LLMs can charge wildly different amounts for doing the same job. LLM cascading involves using the cheapest model for each task based on it being able to meet an accuracy threshold. If the cheapest one can do the job then that’s the one you use. If it can’t then the next least expensive one takes up the reins, and so on.
Caching (LLM Approximation)
The idea here is that because online service users frequently access popular or trending content, cache systems can respond to this behavior by storing commonly accessed data.
This potentially gives you various queries and responses that you shouldn’t need to pay for more than once. If you can cache the response for the right query then you have no need to bother the API for an answer. This reduces outgoing API calls to LLMs so it has the potential to significantly reduce your costs.
One problem with this approach is that understanding the similarity between prompts is not always easy because LLM queries can be so complex and varied. Also, storage caches can become very large and then there’s the fact that caching is less effective for context-dependent outputs.
- API consumption costs can and most probably will become a significant part of your cloud billing. It’s better to invest in active controls right now before it's too late.
- After visibility, active controls are the next most important investment you can make to control your cloud spend.
- In the next wave of GenAI, optimizing consumption is becoming a new frontier that needs to be addressed with cloud architecture adaptations.
While building middleware to control soaring costs and maintenance for OpenAI and such is tempting, it’s no easy feat. Since consumption is continuing to grow for scaling companies, our recommendation at the moment is to employ a dedicated infrastructure service like the Lunar.dev Egress API Proxy to handle all the optimization policies that I mentioned. Since every API call is wired through the proxy, it’s only submitted once it’s been optimized. The API provider will only receive it in its most cost-effective and efficient form, or in some cases, it will be dealt with via a cached response – whichever is the most efficient!
Intrigued? Drop us an email and we’ll be happy to tell you more about how Lunar.dev can help you better manage your OpenAI API costs.
Ready to Start your journey?
Manage a single service and unlock API management at scale