Abu Qader's profile photo

Abu Qader

Featured in:

Articles

  • Sep 12, 2024 | baseten.co | Bryce Dubayah |Philip Kiely |Abu Qader |Pankaj Gupta

    Today, we announced support for function calling and structured output for LLMs deployed with our TensorRT-LLM Engine Builder. This adds support at the model server level for two key features:Function calling: also known as “tool use,” this feature lets you pass a set of defined tools to a LLM as part of the request body. Based on the prompt, the model selects and returns the most appropriate function/tool from the provided options.

  • Aug 20, 2024 | baseten.co | Abu Qader |Philip Kiely |Pankaj Gupta

    Everyone wants more tokens per second for their LLMs. There are reliable ways to make inference faster, like using an H100 GPU or TensorRT-LLM. Then there are techniques like quantization that are increasingly robust, but you have to be careful not to ruin model output quality in pursuit of speed. After these kinds of fundamental optimizations are in place, more speed requires implementing cutting-edge inference techniques and managing the tradeoffs that come with them.

  • Jul 22, 2024 | baseten.co | Pankaj Gupta |Philip Kiely |Abu Qader |Justin Yi

    You can now swap LoRAs during LLM inference with TensorRT-LLM on Baseten. This means you can serve thousands of fine-tuned variants of an LLM from a single GPU while maintaining a low time to first token (TTFT) and a high tokens per second (TPS). Let’s say you’re building a product that requires many fine-tuned models, such as a support system with a specialized LLM per customer or a chatbot that’s adjusted to each user’s preferences.

  • Mar 14, 2024 | baseten.co | Abu Qader |Pankaj Gupta |Justin Yi |Philip Kiely

    Baseten has achieved best in class performance for key latency and throughput metrics as measured by independent researchers at Artificial Analysis. ✕The Artificial Analysis benchmark measures essential metrics for model performance:Time to first token (TTFT): The time from when a request is sent to the model to when the first token (or chunk) of output is received. Tokens per second (TPS): The average number of tokens per second received during the entire response.

  • Mar 14, 2024 | baseten.co | Pankaj Gupta |Philip Kiely |Abu Qader |Justin Yi

    Quantization is the process of mapping model parameters from one data format (most commonly FP16 for LLMs) to a smaller data format, like INT8 or FP8. Quantizing a model offers faster, less expensive inference.

Contact details

Socials & Sites

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.

Start Your 7-Day Free Trial →