Phil CardyScarlet Howes

Featured in: Favicon

dailyrecord.co.uk

Articles

Comparing tokens per second across LLMs

May 9, 2024 | baseten.co | Philip Kiely |Matt Howard |Pankaj Gupta |Phil CardyScarlet Howes

Tokens per second (TPS) is one of the most important metrics we track for LLM performance. When comparing performance across two different LLMs, you need to adjust TPS based on the modelsâ€™ tokenizers. The tokenizer is a small model that takes human-readable input text and turns it into the tokens that an LLM uses for inference. Different LLMs have different tokenizers of varying levels of efficiency:Llama 3 models use a variant of tiktoken, a tokenizer developed by OpenAI and used by GPT-4.
Continuous vs dynamic batching for AI inference

Apr 5, 2024 | baseten.co | Matt Howard |Philip Kiely |Pankaj Gupta |Phil CardyScarlet Howes

Batch inference is essential for serving LLMs and other generative models in production. If you only run one request at a time through a GPU, most of its capacity is sitting idle. Running multiple inputs through the model simultaneously uses more of the GPUâ€™s resources to massively increase the throughput of your model deployment. However, itâ€™s important to choose the right strategy for batch inference to make sure your model still performs well on other important metrics like latency.
FP8: Efficient model inference with 8-bit floating point numbers

Mar 6, 2024 | baseten.co | Pankaj Gupta |Philip Kiely |Phil CardyScarlet Howes |Marius F Killinger

Large language models (LLMs) have billions of parameters, each of which is a number that has to be stored, read, and processed in a consistent format while running the model. There are multiple data formats, or precisions, that you can use for model parameters. By default, most LLM inference uses a data format called FP16, meaning that the modelâ€™s components (weights, activations, KV cache) are all expressed as 16-bit floating point numbers.
The benefits of globally distributed infrastructure for ML model serving

Mar 1, 2024 | baseten.co | Phil CardyScarlet Howes |Philip Kiely |Marius F Killinger |Abu Qader

Directly optimizing an ML model is essential for high-performance inference, but the infrastructure used to serve that optimized model can have just as much â€” or even more â€” impact on how the model performs end-to-end in production.