Following our previous post on how to self-host Llama-2, we got several people asking us about the best hosting options for Llama-2. While a seemingly simple question, as we’ve dug deeper we’ve found that there are a maze of options for companies to navigate.
There are several providers to choose from. Among the traditional cloud service providers (CSPs) such as AWS, GCP and Azure there are different available instance types with different GPU options. Then there are newer, more GPU focussed CSPs such as Lambda Labs and CoreWeave. Oracle and HuggingFace offer GPU options that charge by hourly usage. Lastly, there are Function based providers such as Replicate, Modal and Banana that charge on usage by the second. Each of these providers has further variables around availability of GPU capacity and different pricing tiers (e.g. on-demand vs. spot vs. long term commitments; hourly vs. second based vs. token based pricing)
Understanding your expected usage is a key factor.
How much throughput (i.e., simultaneous requests per second) do you expect?
How much utilization (i.e., how bursty will the usage be)?
What will be the average LLM call time (i.e., how long will the prompts take to complete on average)?
Which model size do you need? The guidance from Replicate is as follow:
70b-chat: for chatbots, dialogue, logic, factual questions, coding
13b-chat: good for writing stories and poems
7b-chat: good for summaries and categorization
How many GPUs to use? Even for the Llama-2-70b-chat model, we have seen quite a bit of variability in how many A100 GPUs are used (below). Likely some of these are using quantization (another option!) to fit on fewer GPUs but it isn’t always clearly stated if and how they are. Some numbers we’ve seen are:
8xA100- Meta repo
4xA100 - HuggingFace
1xA100 - Replicate
One question that comes up is how much utilization do I need to make it worthwhile to switch from a Function based provider like Replicate to hosting on my own instance via a CSP? For our initial analysis, we looked at the on-demand pricing around configuring a 4xA100 (80G, except 40G for Modal and Banana) system to run Llama-2 inference on the 70b-chat model on 9 providers mentioned above.
Further assumptions for the purpose of this analysis:
Average call time of 10s, throughput of 0.1 rps (i.e., 1 simultaneous request at a time. Multiply the costs below proportionately with the number of desired simultaneous requests)
Doesn’t take into account any idle or cold start times for the Function based providers (lower bound on costs), or networking egress and storage costs for the CSPs
Assumes CSP instances are on for the full month (although in principle one could turn them on or off with additional work).
Here were the findings:
To the right of the red line intercepts is the utilization zone where CSPs are more cost efficient than the Function provider. To the left of the blue line intercepts is the utilization zone where the Function providers are more cost efficient than the CSPs.
At utilizations as low as 4% (for the cheapest CSPs, and at a further, but fairly low 18% for the most expensive CSP in our analysis), it’s cheaper to run the CSP instances continuously, versus using a Function provider.
Lambda Labs comes out the cheapest amongst CSPs (but anecdotally it’s been hard to get capacity there)
Modal appears cheapest among the Function providers but a caveat is that we didn’t factor in the additional costs related to memory and CPU cores there.
Anecdotally, reduced latency is another factor in favor of the CSPs over the Function providers
Based on this analysis, Lambda Labs and CoreWeave would be the most cost effective options for hosting and deploying Llama-2 assuming you can get capacity with them. Oracle is the next best bet in the larger CSP category. Services like Replicate and Modal could be great to get started quickly for prototyping while the spend is <$4000/month, but beyond that and at scale you’ll like want to use your own cloud hosted solution not just for cost but for ownership and data privacy reasons.
How much are you spending on Llama-2 today? Tell us here. And do reach out if you need help figuring out where to efficiently host your Llama-2 model. We’re happy to speak with you, understand your requirements, and provide an estimate.
References used for pricing (as of August 17, 2023):
https://cloud.google.com/compute/vm-instance-pricing#accelerator-optimized
Didn’t include Azure in the above analysis as they were the only CSP without the 80G A100 option. 4xA100 on the 40G option would’ve worked out to $9,790.20 if you’re curious!