90% Reduction in LLM Traffic by Just One Line of Code

A recent discovery revealed that a seemingly clean and efficient async Python script was secretly sending unnecessary requests to our FastAPI server, driving up costs and load. The culprit? A misunderstanding of how tasks were scheduled.

Our client script ran 100 requests concurrently, but the event loop continued to schedule new tasks as soon as it started iterating over the list of completed futures. This meant that all 100 requests were being sent immediately, regardless of when we decided to stop listening.

To fix this, we introduced a semaphore to limit concurrency, ensuring that only a fixed number of requests could start at the same time. With this change, we saw an immediate 90% reduction in request volume and LLM cost, with no noticeable degradation in client experience.

This small structural adjustment taught us an important lesson about responsible, efficient engineering. By being mindful of how our code runs and making intentional design decisions, we can save time, money, and frustration in the long run. Remember, a few lines of code can make all the difference.

Source: https://towardsdatascience.com/how-we-reduced-llm-cost-by-90-with-5-lines-of-code