
Philip Kiely
Articles
-
Oct 22, 2024 |
baseten.co | Pankaj Gupta |Philip Kiely |Matt Howard
The NVIDIA H200 Tensor Core GPU is a high-end datacenter-grade GPU designed for AI workloads. The big sibling of the popular H100 GPU, the H200 offers more GPU memory and memory bandwidth on an equivalent compute profile. While H200 GPUs are eagerly anticipated for training, fine-tuning, and other long-running AI workloads, we wanted to see how they perform for inference tasks. With our friends at Lambda, we tested Mistral Large, a 123-billion parameter model, on an 8xH200 GPU cluster.
-
Sep 12, 2024 |
baseten.co | Bryce Dubayah |Philip Kiely |Abu Qader |Pankaj Gupta
Today, we announced support for function calling and structured output for LLMs deployed with our TensorRT-LLM Engine Builder. This adds support at the model server level for two key features:Function calling: also known as “tool use,” this feature lets you pass a set of defined tools to a LLM as part of the request body. Based on the prompt, the model selects and returns the most appropriate function/tool from the provided options.
-
Aug 20, 2024 |
baseten.co | Abu Qader |Philip Kiely |Pankaj Gupta
Everyone wants more tokens per second for their LLMs. There are reliable ways to make inference faster, like using an H100 GPU or TensorRT-LLM. Then there are techniques like quantization that are increasingly robust, but you have to be careful not to ruin model output quality in pursuit of speed. After these kinds of fundamental optimizations are in place, more speed requires implementing cutting-edge inference techniques and managing the tradeoffs that come with them.
-
Jul 22, 2024 |
baseten.co | Pankaj Gupta |Philip Kiely |Abu Qader |Justin Yi
You can now swap LoRAs during LLM inference with TensorRT-LLM on Baseten. This means you can serve thousands of fine-tuned variants of an LLM from a single GPU while maintaining a low time to first token (TTFT) and a high tokens per second (TPS). Let’s say you’re building a product that requires many fine-tuned models, such as a support system with a specialized LLM per customer or a chatbot that’s adjusted to each user’s preferences.
-
May 29, 2024 |
baseten.co | Colin McGrath |Matt Howard |Philip Kiely |Pankaj Gupta
Building worldwide multi-cloud AI model serving infrastructure requires powerful-but-flexible abstractions. One of the core abstractions in Baseten’s infra architecture is the idea of a control plane and workload planes. Both the control and workload planes are Kubernetes clusters, but they follow a strict separation of concerns:The control plane is a single Kubernetes cluster that serves as the backend for Baseten’s user interface and model management API endpoints.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →