Banana has a service that watches for traffic and automatically scales your replicas (inference servers) from zero to as many as you need.

The autoscaler is responsible for how many servers you run and how long they run for. This is directly related to GPU time, therefore billing. Understanding the autoscaler is key to balancing latency and cost.

The Life of a Call

  1. Call is initiated by the SDK

  2. Call is placed in queue

  3. Autoscaler sees the waiting call in queue. Starts server.

  4. Server handles inferences until queue is drained.

  5. Server sits idle for timeout time to handle any new calls without incurring a cold boot.

Tuning the Autoscaler


The purpose of the idle timeout is to keep a model warm and ready to answer calls without a cold boot.

Longer timeout = the server will remain ready for longer, reducing latency due to cold boots but increasing GPU time.

Shorter timeout = the server will shut down sooner, increasing latency due to frequent cold boots but reducing GPU time.

Minimum Replicas

The purpose of minimum replicas is to eliminate cold boots up to a threshold by keeping a quantity of GPU servers always on.

Higher Min. Replicas = the model can handle n = min_replica concurrent calls without a cold boot. This increases GPU time significantly since the server(s) is running 24/7.

Lower Min. Replicas = setting min_replica = 0 makes it so you are never paying for idle time, but all servers will cold boot.

Maximum Replicas

The purpose of maximum replicas is to set a limit on how many servers can run at any given time.

Hint: If you want to reduce cold boots, it is generally more economical to increase idle timeout than it is to increase minimum replicas.

Last updated