A guide to understanding the different parts of an inference all
After tinkering with Banana for a while you might hit the point where you want to get a better understanding of how inference calls work. Maybe you want to do some optimisations to make your calls faster or just want to know more out of curiosity. Regardless, this is the place for you.
There are two types of inference calls: calls with cold starts and calls with warm starts.
A cold start occurs when there are no instances running for a particular model. What this means is that instance needs to be started and the model weights need to be loaded into the GPU memory. The time this requires is call the cold start time.
After the result from your inference task has returned. Your model has what's called an idle timeout. Essentially, after completing a task, your model waits for for an X amount of seconds for a new task to arrive. If a new task arrives within this time, you'll have a call with a warm start. If not, the instances is shut down.
A warm start occurs when there is a running instance and the model is loaded into GPU memory. This happens in two cases:
- 1.You have a model that just finished an inference and is now in the idle timeout
- 2.You have an always on replica that isn't doing an inference job
The main difference is that you save time by not having to start an instance and load the model into memory. But note that running an instance warm increases your billable GPU time.
There are several things you can tweak to make your inferences faster. Let's go through some of them.
- 1.The green sections tend to become a slower when networking is a bottleneck. This is especially the case for models like whisper, where you usually send larger audio files to the model. To mitigate this we recommend storing the actual data in a storage bucket, like S3. Send the url in the payload and then download the audio file in the inference function.
- 2.Banana has a few in-house technologies to reduce the time of cold starts. Most notably model optimisations and Turboboot. Also, note that model size does matter. A smaller model is simply faster to load to memory than a bigger.
- 3.The inference time itself is something you have 100% control over. There are many open source tools that you can leverage to make your model as fast as possible. A few examples would be TensorRT and SafeTensors.