Marius F Killinger

Featured in: Favicon

nature.com

Articles

FP8: Efficient model inference with 8-bit floating point numbers

Mar 6, 2024 | baseten.co | Pankaj Gupta |Philip Kiely |Phil CardyScarlet Howes |Marius F Killinger

Large language models (LLMs) have billions of parameters, each of which is a number that has to be stored, read, and processed in a consistent format while running the model. There are multiple data formats, or precisions, that you can use for model parameters. By default, most LLM inference uses a data format called FP16, meaning that the modelâ€™s components (weights, activations, KV cache) are all expressed as 16-bit floating point numbers.
The benefits of globally distributed infrastructure for ML model serving

Mar 1, 2024 | baseten.co | Phil CardyScarlet Howes |Philip Kiely |Marius F Killinger |Abu Qader

Directly optimizing an ML model is essential for high-performance inference, but the infrastructure used to serve that optimized model can have just as much â€” or even more â€” impact on how the model performs end-to-end in production.
Why GPU utilization matters for model inference

Feb 20, 2024 | baseten.co | Marius F Killinger |Philip Kiely |Abu Qader

GPU utilization is the measure of how much of a GPUâ€™s resources are in use at any given time during a workload. When running ML models, we want to maximize GPU utilization to decrease the cost of serving high-traffic model endpoints. If you get more performance from each GPU, youâ€™re able to serve the same traffic with fewer GPUs, saving on model hosting costs. Imagine youâ€™re at the office with your entire team â€” letâ€™s say 12 people.

Contact details

Emails

[email protected]

Socials & Sites

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.

Start Your 7-Day Free Trial →

Marius F Killinger

Articles

FP8: Efficient model inference with 8-bit floating point numbers

The benefits of globally distributed infrastructure for ML model serving

Why GPU utilization matters for model inference

Contact details

Emails

Socials & Sites