
Marius F Killinger
Articles
-
Mar 6, 2024 |
baseten.co | Pankaj Gupta |Philip Kiely |Phil CardyScarlet Howes |Marius F Killinger
Large language models (LLMs) have billions of parameters, each of which is a number that has to be stored, read, and processed in a consistent format while running the model. There are multiple data formats, or precisions, that you can use for model parameters. By default, most LLM inference uses a data format called FP16, meaning that the model’s components (weights, activations, KV cache) are all expressed as 16-bit floating point numbers.
-
Mar 1, 2024 |
baseten.co | Phil CardyScarlet Howes |Philip Kiely |Marius F Killinger |Abu Qader
Directly optimizing an ML model is essential for high-performance inference, but the infrastructure used to serve that optimized model can have just as much — or even more — impact on how the model performs end-to-end in production.
-
Feb 20, 2024 |
baseten.co | Marius F Killinger |Philip Kiely |Abu Qader
GPU utilization is the measure of how much of a GPU’s resources are in use at any given time during a workload. When running ML models, we want to maximize GPU utilization to decrease the cost of serving high-traffic model endpoints. If you get more performance from each GPU, you’re able to serve the same traffic with fewer GPUs, saving on model hosting costs. Imagine you’re at the office with your entire team — let’s say 12 people.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →