Philip Kiely

Featured in: Favicon

wiley.com

smashingmagazine.com Favicon

mja.com.au Favicon

vtcng.com

onlinelibrary.wiley.com

Articles

Evaluating NVIDIA H200 GPUs for LLM inference

Oct 22, 2024 | baseten.co | Pankaj Gupta |Philip Kiely |Matt Howard

The NVIDIA H200 Tensor Core GPU is a high-end datacenter-grade GPU designed for AI workloads. The big sibling of the popular H100 GPU, the H200 offers more GPU memory and memory bandwidth on an equivalent compute profile. While H200 GPUs are eagerly anticipated for training, fine-tuning, and other long-running AI workloads, we wanted to see how they perform for inference tasks. With our friends at Lambda, we tested Mistral Large, a 123-billion parameter model, on an 8xH200 GPU cluster.
How to build function calling and JSON mode for open-source and fine-tuned LLMs

Sep 12, 2024 | baseten.co | Bryce Dubayah |Philip Kiely |Abu Qader |Pankaj Gupta

Today, we announced support for function calling and structured output for LLMs deployed with our TensorRT-LLM Engine Builder. This adds support at the model server level for two key features:Function calling: also known as â€śtool use,â€ť this feature lets you pass a set of defined tools to a LLM as part of the request body. Based on the prompt, the model selects and returns the most appropriate function/tool from the provided options.
How to double tokens per second for Llama 3 with Medusa

Aug 20, 2024 | baseten.co | Abu Qader |Philip Kiely |Pankaj Gupta

Everyone wants more tokens per second for their LLMs. There are reliable ways to make inference faster, like using an H100 GPU or TensorRT-LLM. Then there are techniques like quantization that are increasingly robust, but you have to be careful not to ruin model output quality in pursuit of speed. After these kinds of fundamental optimizations are in place, more speed requires implementing cutting-edge inference techniques and managing the tradeoffs that come with them.
How to serve 10,000 fine-tuned LLMs from a single GPU

Jul 22, 2024 | baseten.co | Pankaj Gupta |Philip Kiely |Abu Qader |Justin Yi

You can now swap LoRAs during LLM inference with TensorRT-LLM on Baseten. This means you can serve thousands of fine-tuned variants of an LLM from a single GPU while maintaining a low time to first token (TTFT) and a high tokens per second (TPS). Letâ€™s say youâ€™re building a product that requires many fine-tuned models, such as a support system with a specialized LLM per customer or a chatbot thatâ€™s adjusted to each userâ€™s preferences.
Control plane vs workload plane in model serving infrastructure

May 29, 2024 | baseten.co | Colin McGrath |Matt Howard |Philip Kiely |Pankaj Gupta

Building worldwide multi-cloud AI model serving infrastructure requires powerful-but-flexible abstractions. One of the core abstractions in Basetenâ€™s infra architecture is the idea of a control plane and workload planes. Both the control and workload planes are Kubernetes clusters, but they follow a strict separation of concerns:The control plane is a single Kubernetes cluster that serves as the backend for Basetenâ€™s user interface and model management API endpoints.
Comparing tokens per second across LLMs

May 9, 2024 | baseten.co | Philip Kiely |Matt Howard |Pankaj Gupta |Phil CardyScarlet Howes

Tokens per second (TPS) is one of the most important metrics we track for LLM performance. When comparing performance across two different LLMs, you need to adjust TPS based on the modelsâ€™ tokenizers. The tokenizer is a small model that takes human-readable input text and turns it into the tokens that an LLM uses for inference. Different LLMs have different tokenizers of varying levels of efficiency:Llama 3 models use a variant of tiktoken, a tokenizer developed by OpenAI and used by GPT-4.
Continuous vs dynamic batching for AI inference

Apr 5, 2024 | baseten.co | Matt Howard |Philip Kiely |Pankaj Gupta |Phil CardyScarlet Howes

Batch inference is essential for serving LLMs and other generative models in production. If you only run one request at a time through a GPU, most of its capacity is sitting idle. Running multiple inputs through the model simultaneously uses more of the GPUâ€™s resources to massively increase the throughput of your model deployment. However, itâ€™s important to choose the right strategy for batch inference to make sure your model still performs well on other important metrics like latency.
Using fractional H100 GPUs for efficient model serving

Mar 28, 2024 | baseten.co | Matt Howard |Vlad Shulman |Pankaj Gupta |Philip Kiely

NVIDIA H100 GPUs support Multi-Instance GPU (MIG), which lets us serve models on fractional GPUs. We can get two H100 MIG models serving instances per H100 GPU, each with about half of the power of a full GPU. Splitting H100 GPUs into two parts allows for more flexibility in hardware choice for model inference.
Benchmarking fast Mistral 7B inference

Mar 14, 2024 | baseten.co | Abu Qader |Pankaj Gupta |Justin Yi |Philip Kiely

Baseten has achieved best in class performance for key latency and throughput metrics as measured by independent researchers at Artificial Analysis. âś•The Artificial Analysis benchmark measures essential metrics for model performance:Time to first token (TTFT): The time from when a request is sent to the model to when the first token (or chunk) of output is received. Tokens per second (TPS): The average number of tokens per second received during the entire response.
33% faster LLM inference with FP8 quantization

Mar 14, 2024 | baseten.co | Pankaj Gupta |Philip Kiely |Abu Qader |Justin Yi

Quantization is the process of mapping model parameters from one data format (most commonly FP16 for LLMs) to a smaller data format, like INT8 or FP8. Quantizing a model offers faster, less expensive inference.