Banana has a service that watches for traffic and automatically scales your replicas (inference servers) from zero to as many as you need.
The autoscaler is responsible for how many servers you run and how long they run for. This is directly related to GPU time, therefore billing. Understanding the autoscaler is key to balancing latency and cost.
- 1.Call is initiated by the SDK
- 2.Call is placed in queue
- 3.Autoscaler sees the waiting call in queue. Starts server.
- 6.Server sits idle for timeout time to handle any new calls without incurring a cold boot.
The purpose of the idle timeout is to keep a model warm and ready to answer calls without a cold boot.
Longer timeout = the server will remain ready for longer, reducing latency due to cold boots but increasing GPU time.
Shorter timeout = the server will shut down sooner, increasing latency due to frequent cold boots but reducing GPU time.
The purpose of minimum replicas is to eliminate cold boots up to a threshold by keeping a quantity of GPU servers always on.
Higher Min. Replicas = the model can handle
n = min_replicaconcurrent calls without a cold boot. This increases GPU time significantly since the server(s) is running 24/7.
Lower Min. Replicas = setting
min_replica = 0makes it so you are never paying for idle time, but all servers will cold boot.
Hint: If you want to reduce cold boots, it is generally more economical to increase idle timeout than it is to increase minimum replicas.