Autoscaler
Banana has a service that watches for traffic and automatically scales your replicas (inference servers) from zero to as many as you need.
The autoscaler is responsible for how many servers you run and how long they run for. This is directly related to GPU time, therefore billing. Understanding the autoscaler is key to balancing latency and cost.
The Life of a Call
Call is initiated by the SDK
Call is placed in queue
Autoscaler sees the waiting call in queue. Starts server.
Server cold boots (warms up).
Server handles inferences until queue is drained.
Server sits idle for timeout time to handle any new calls without incurring a cold boot.
Tuning the Autoscaler
Hint: If you want to reduce cold boots, it is generally more economical to increase idle timeout than it is to increase minimum replicas.
Last updated