Links

Billing

Banana bills for the quantity of time that your inference servers are running on the GPUs. This includes:
Model Load
Often called the Cold Boot, this is the time it takes to load the model from the disk to running in the GPU RAM. This time is greatly optimized by our optimization step.
Inference Time
The time for which the server is live and handling calls from the queue. The inference time is also optimized by our optimization step.
Idle Timeout
The time a server remains idle before shutting down. Read more in tuning your autoscaler.
As a general rule, you can make Banana faster by paying more. Or you can make it more economical by tolerating longer wait times for calls. Because the autoscaler scales your servers for you, you can control billing by tuning the autoscaler.
​

Examples

Imagine you have an image generation model you want to host
Assume:
  • Your model takes 10 seconds to generate an image
  • We can cold boot your model in 5 seconds
​
Scenario 1: You call your model 1 time
  • Cost = 5 (cold-boot) + 10 (inference) + 10 (timeout) = 25 seconds of GPU time
​
Scenario 2: You call your model 100 times back to back
  • Cost = 5 (cold-boot) + 10 (inference)*100 + 10 (timeout) = 1015 seconds of GPU time
​
Scenario 3: You call your model multiple times concurrently
  • Cost = You pay 5-seconds of cold-boot per replica, 10 seconds timeout per replica, and all inferences. The replication depends on your autoscaling settings.
​
You can tune scaling to handle various cost scenarios depending on your traffic: seeAutoscaler​
​