Builds ยท mentor.work ยท AI view
How I Cut My LLM Bill to Near Zero Without Changing a Single Line of App Code
There's a particular kind of dread that comes with checking your OpenAI billing dashboard mid-month.
I've built a news automation hub that runs 14 editorial workspaces — summarizing, rewriting, fact-checking, SEO-tagging — around the clock. The AI layer was already doing smart things: rotating between Groq, Gemini Flash, DeepSeek. But the fallback was always the same expensive safety net, and it got hit more than I wanted.
The fix turned out to be embarrassingly simple.
The actual problem
My rotation logic was clever but narrow. Seven providers, yes — but I was always one Groq rate-limit away from burning credits on OpenAI. The free tiers I had configured were already maxed. And adding more providers meant more code, more key management, more conditional branches.
What I actually needed wasn't smarter code. I needed a smarter endpoint.
What I did
I stumbled on freellmapi — a self-hosted proxy that puts 14+ free-tier LLM providers behind one OpenAI-compatible endpoint. Groq, Cerebras, SambaNova, Cloudflare Workers AI, GitHub Models, OpenRouter free tier, and more. Combined: roughly 800M tokens/month free.
The integration on my end? Three things:
- Deploy the service on my existing VPS (Node 20, ~40MB RAM idle)
- Drop my Groq key and Cloudflare credentials into its admin
- Add one new provider entry to my routing layer that calls localhost:3001 without specifying a model, letting freellmapi pick whatever is available
That last part is the key insight: when you don't specify a model, it auto-routes. Groq exhausted? Cloudflare Workers AI picks up. That provider slow? Next in line. All of this happens inside the proxy, invisible to my application code.
My AI hub didn't need to know any of this. It just saw another provider in the rotation.
What I learned
The instinct when hitting rate limits is to add more code — more providers, more logic, more conditionals. But sometimes the right move is to add an abstraction layer that handles the chaos for you.
freellmapi isn't magic. You still need to seed it with keys, and the free tiers have real limits. But the failover is automatic, the key rotation is built-in, and the ~800M monthly tokens is a meaningful buffer before you touch anything paid.
More importantly: it forced me to think about where I actually needed a specific model versus where any decent 70B model would do. Turns out, most tasks fall in the second category.
The honest caveat
Free tiers have latency variance. Some providers are fast, some have cold-start delays. For real-time user-facing responses, test your fallback chain carefully. For background pipelines and batch jobs — summarization, translation, content drafts — it's been solid.
Also: run it on your own infrastructure. This proxy handles your upstream API keys — don't hand that to a third party.
The build took about an hour. The bill impact was immediate.
Sometimes the best optimization is the one you don't have to maintain.
This article was AI-assisted and edited by Mervin. All facts were verified against primary sources before publishing.