Banana bills for the quantity of time that your inference servers are running on the GPUs. This includes:
Often called the Cold Boot, this is the time it takes to load the model from the disk to running in the GPU RAM. This time is greatly optimized by our optimization step.
The time for which the server is live and handling calls from the queue. The inference time is also optimized by our optimization step.
As a general rule, you can make Banana faster by paying more. Or you can make it more economical by tolerating longer wait times for calls. Because the autoscaler scales your servers for you, you can control billing by tuning the autoscaler.
Imagine you have an image generation model you want to host
- Your model takes 10 seconds to generate an image
- We can cold boot your model in 5 seconds
Scenario 1: You call your model 1 time
- Cost = 5 (cold-boot) + 10 (inference) + 10 (timeout) = 25 seconds of GPU time
Scenario 2: You call your model 100 times back to back
- Cost = 5 (cold-boot) + 10 (inference)*100 + 10 (timeout) = 1015 seconds of GPU time
Scenario 3: You call your model multiple times concurrently
- Cost = You pay 5-seconds of cold-boot per replica, 10 seconds timeout per replica, and all inferences. The replication depends on your autoscaling settings.