
Phil CardyScarlet Howes
Articles
-
May 9, 2024 |
baseten.co | Philip Kiely |Matt Howard |Pankaj Gupta |Phil CardyScarlet Howes
Tokens per second (TPS) is one of the most important metrics we track for LLM performance. When comparing performance across two different LLMs, you need to adjust TPS based on the models’ tokenizers. The tokenizer is a small model that takes human-readable input text and turns it into the tokens that an LLM uses for inference. Different LLMs have different tokenizers of varying levels of efficiency:Llama 3 models use a variant of tiktoken, a tokenizer developed by OpenAI and used by GPT-4.
-
Apr 5, 2024 |
baseten.co | Matt Howard |Philip Kiely |Pankaj Gupta |Phil CardyScarlet Howes
Batch inference is essential for serving LLMs and other generative models in production. If you only run one request at a time through a GPU, most of its capacity is sitting idle. Running multiple inputs through the model simultaneously uses more of the GPU’s resources to massively increase the throughput of your model deployment. However, it’s important to choose the right strategy for batch inference to make sure your model still performs well on other important metrics like latency.
-
Mar 6, 2024 |
baseten.co | Pankaj Gupta |Philip Kiely |Phil CardyScarlet Howes |Marius F Killinger
Large language models (LLMs) have billions of parameters, each of which is a number that has to be stored, read, and processed in a consistent format while running the model. There are multiple data formats, or precisions, that you can use for model parameters. By default, most LLM inference uses a data format called FP16, meaning that the model’s components (weights, activations, KV cache) are all expressed as 16-bit floating point numbers.
-
Mar 1, 2024 |
baseten.co | Phil CardyScarlet Howes |Philip Kiely |Marius F Killinger |Abu Qader
Directly optimizing an ML model is essential for high-performance inference, but the infrastructure used to serve that optimized model can have just as much — or even more — impact on how the model performs end-to-end in production.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →