This is a fully remote position, open to applicants in United States.

📋 Description

• Implement and enhance large language models (LLMs) such as GPT, LLaMA, Mistral, and Falcon after training using resources like HuggingFace.

• Leverage inference runtimes including ONNX Runtime and vLLM to ensure efficient execution.

• Improve LLM scalability in real-time applications by optimizing batching, caching, and tensor parallelism.

• Create and sustain high-performance inference pipelines utilizing Docker, Kubernetes, and various inference servers.

⛳️ Requirements

• A Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related discipline.

• Proven experience in deploying LLM inference, optimizing models, and engineering runtimes.

• Strong proficiency in LLM inference frameworks such as PyTorch, ONNX Runtime, vLLM, TensorRT-LLM, and DeepSpeed.

• Comprehensive knowledge of the Python programming language for model integration and performance enhancements.

• Solid understanding of high-level model representations with experience in implementing optimizations at the framework level for Generative AI applications.

• Familiarity with containerized AI deployments using tools like Docker, Kubernetes, Triton Inference Server, TensorFlow Serving, and TorchServe.

• Extensive knowledge of memory optimization techniques for LLMs in long-context scenarios.

• Experience with real-time LLM applications, including chatbots, code generation, and retrieval-augmented generation.

🏝️ Benefits

• Competitive salary and performance-based bonuses.

• Comprehensive health, dental, and vision insurance.

• Opportunities for professional development and continuous learning.

• Flexible work hours and remote work options.

LLM Inference Deployment Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Cloud Engineer – DevOps

DevSecOps/DevOps Engineer

Deployment Engineer

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Never miss a great job!