Meta recently has opened the floodgates for self-hosted LLMs by releasing their Llama 3.1 and 3.2 models. To somewhat less public fanfare, Alibaba also recently released a powerful set of models called Qwen 2.5.
I am building a tool that passes long-context prompts and requests various summaries of the context. LLM providers with APIs charge on a per-token basis, so the longer my prompt the more I pay. Self-hosting should fix this because the costs are fixed, so as I grow the user-base and send more and more tokens to the LLM my costs stay the same.
That's what I think will happen, at least. Let's try it out.
I set up an account on Runpod. Prices range from $0.50 - $3.00 per hour. Hourly pricing? Ok, apparently the standard for GPU pricing is per hour so I have to do the monthly calculation myself. So the monthly price is $360 - $2,160. Yowza. Ok, let's see if this is tenable. At $30/month subscription cost I need between 12 and 72 customers to pay this off. OK, 12 feels good, 72 not so much. Let's figure out how much scale I need.
I research the different kinds of GPUs offered and decide on the Nvidia A100 at $1.19 per hour. This apparently can run Llama 3.1 70b at 4 bit quantization with no problem (oh yeah, had to learn what quantization is. My basic understanding now is that it's like lossy compression for LLMs). So $1.19 per hour is $856 per month, which requires 29 customers to cover. I set up the instance with a handily supplied Dockerfile for running Ollama, which is apparently the easiest way to get LLMs up and running.
I ssh into the instance, type ollama run llama3.1:70b
and wow it works! I type some fun chat queries like give me a plan to take over the world
because I thought using open source models meant they will do that for me but unfortunately it said gave me the standard "that is unethical" bs. But the responses are snappy and this is seeming very promising.
Cool, now to hook it up to my API. Index the API docs, give it to Cursor, tell Cursor to make it implement my AI wrapper interface, set the request url to my runpod instance, bada bing bada boom.
I run it.... oh boy this is slow. OK, no problem, just put a rate limit on there. 10 requests per second. OK, this is less slow. But after a minute or so the request queue is piling up on my runpod instance and all of the requests start timing out. Alright, I add fine-grained control over the total amount of concurrent requests we can have open at once (let's say 10). This is doing well now. But it's too slow! In testing I am running a job that sends a lot of concurrent requests to the LLM. In prod I am going to have multiple jobs running at the same time. 10 concurrent requests is not enough to get good performance for just me, so it is definitely not enough for just me AND at least 29 other users.
I do some more research and find that Ollama isn't really built to perform well with concurrent requests. For that use case, VLLM is recommended. OK, kill the runpod instance, get a new one with the latest VLLM template. Lots of learning and messing around with the config, and voila I have an instance running Llama 8b and another instance running Qwen 32b. I try out the rate limiting with Llama 8b and requests are still coming at too fast of a rate for the server so I limit us to 25 concurrent requests and now things are humming along really well. 25 concurrent requests with pretty snappy speed is looking good. The problem is Llama 8b is just not giving me good output.
I looked around for info on Qwen and benchmarks are comparing it to GPT-4o-mini, which I am currently using in prod. OK, try it out and it's slow but it works. I have to limit concurrent requests to 15. It works, but it's slower than I would like. The quality of the output is pretty similar to GPT-4o-mini though. I switch back to GPT-4o-mini and the speed it blazing fast. The cost/performance tradeoff is stark. It's not fun to watch the cost on my OpenAI usage dashboard shoot up, but it is nice to watch the job spit out information blazingly fast.
The real question is how much can I scale this? If I got an H100 what kind of concurrency could that handle? It's not worth thinking about right now, but in a hypothetical future where I have something like 500 users maybe it makes sense to use an H100 and self-host. For now it is tempting to keep trying to make things work at zero scale, but I think the most prudent thing to do is to keep using APIs. Once there is more traffic in the system then maybe the economics will work out to realize the dream of zero marginal-cost inference.