A guide to understanding the different parts of an inference all
After tinkering with Banana for a while you might hit the point where you want to get a better understanding of how inference calls work. Maybe you want to do some optimizations to make your calls faster or just want to know more out of curiosity. Regardless, this is the place for you.
The anatomy of an inference call
There are two types of inference calls: calls with cold starts and calls with warm starts.
Calls with cold starts
A cold start occurs when there are no instances running for a particular project. What this means is that an instance needs to be started and the model weights need to be loaded into the GPU memory. The time this requires is call the cold start time.
After the result from your inference task has returned. Your project has what's called an idle timeout. Essentially, after completing a task, your project waits for an X amount of seconds for a new task to arrive. If a new task arrives within this time, you'll have a call with a warm start. If not, the instances is shut down.
Calls with warm starts
A warm start occurs when there is a running instance and the project is loaded into GPU memory. This happens in two cases:
You have a project that just finished an inference and is now in the idle timeout
You have an always on replica that isn't doing an inference job
The main difference is that you save time by not having to start an instance and load the model into memory. But note that running an instance warm increases your billable GPU time.
How to make the call faster?
There are several things you can tweak to make your inferences faster. Let's go through some of them.
The green sections tend to become a slower when networking is a bottleneck. This is especially the case for models like whisper, where you usually send larger audio files to the model. To mitigate this we recommend storing the actual data in a storage bucket, like S3. Send the url in the payload and then download the audio file in the inference function.
Banana has a few in-house technologies to reduce the time of cold starts. Most notably model optimisations and Turboboot. Also, note that model size does matter. A smaller model is simply faster to load to memory than a bigger.
The inference time itself is something you have 100% control over. There are many open source tools that you can leverage to make your model as fast as possible. A few examples would be TensorRT and SafeTensors.
You can gain a little bit of additional control over the start of the inference timing by using Potassium's built in "warmup" functionality if that fits your use case and user experience.