
LLM Inference Deployment Engineer
Posted 5 days ago

Posted 5 days ago
This is a fully remote position, open to applicants in United States.
• Implement and enhance large language models (LLMs) such as GPT, LLaMA, Mistral, and Falcon after training using resources like HuggingFace.
• Leverage inference runtimes including ONNX Runtime and vLLM to ensure efficient execution.
• Improve LLM scalability in real-time applications by optimizing batching, caching, and tensor parallelism.
• Create and sustain high-performance inference pipelines utilizing Docker, Kubernetes, and various inference servers.
• A Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related discipline.
• Proven experience in deploying LLM inference, optimizing models, and engineering runtimes.
• Strong proficiency in LLM inference frameworks such as PyTorch, ONNX Runtime, vLLM, TensorRT-LLM, and DeepSpeed.
• Comprehensive knowledge of the Python programming language for model integration and performance enhancements.
• Solid understanding of high-level model representations with experience in implementing optimizations at the framework level for Generative AI applications.
• Familiarity with containerized AI deployments using tools like Docker, Kubernetes, Triton Inference Server, TensorFlow Serving, and TorchServe.
• Extensive knowledge of memory optimization techniques for LLMs in long-context scenarios.
• Experience with real-time LLM applications, including chatbots, code generation, and retrieval-augmented generation.
• Competitive salary and performance-based bonuses.
• Comprehensive health, dental, and vision insurance.
• Opportunities for professional development and continuous learning.
• Flexible work hours and remote work options.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.