
Senior/Principal AI Performance Engineer
Posted May 6

Posted May 6
• Design, implement, and optimize inference pipelines for large language models and other AI workloads to achieve maximum throughput and minimal latency.
• Utilize cutting-edge optimization methods: quantization (INT4/INT8/FP8), model pruning, speculative decoding, continuous batching, and kernel fusion.
• Enhance inference-serving stacks, such as vLLM, TensorRT-LLM, ONNX Runtime, and similar frameworks, for deployment on CIQ’s OS platform.
• Profile and optimize GPU/accelerator utilization throughout the entire inference stack, including model weights, memory bandwidth, CUDA kernels, and driver overhead.
• Establish performance baselines for inference and implement regression detection across CIQ’s AI-driven solutions.
• Design and refine distributed training pipelines for large-scale models, incorporating data, model, tensor, and pipeline parallelism strategies.
• Improve training efficiency through mixed-precision training, gradient checkpointing, activation recomputation, and enhancements at the optimizer level.
• Benchmark training throughput and scaling efficiency across multi-GPU and multi-node configurations on CIQ’s infrastructure.
• Collaborate with infrastructure and performance teams to identify and resolve training bottlenecks within the network (RDMA/InfiniBand), storage, and OS layers.
• Stay informed about cutting-edge model architectures and training methodologies, including MoE models, RLHF pipelines, and new post-training techniques.
• Develop and maintain a library of ready-to-use AI workload examples that operate on CIQ’s platform, covering inference serving, fine-tuning, batch processing, RAG pipelines, and agentic workflows.
• Create both internal reference pipelines for CI/testing and customer-facing examples designed for quick productivity on CIQ’s OS and Fuzzball.
• Package workloads using containers to provide portable, reproducible AI environments across HPC and cloud-native settings.
• Develop engaging, well-documented demonstrations and reference architectures that effectively communicate CIQ’s AI capabilities to both technical and business audiences.
• Collaborate with product and customer success teams to translate practical AI use cases into reusable, production-ready examples.
• Build and maintain AI-driven engineering tools, leveraging LLM-based agents, automated analysis pipelines, and AI-assisted code generation to enhance the broader engineering organization.
• Advocate for an AI-first development culture by identifying areas where AI tools can reduce manual effort, accelerate insights, and enhance software quality across CIQ’s products.
• Assess and incorporate emerging AI frameworks, libraries, and hardware as they become relevant to CIQ’s customers and product roadmap.
• Contribute to open-source AI tools and frameworks where applicable, reinforcing CIQ’s technical reputation within the community.
• Acquire in-depth knowledge of CIQ’s Fuzzball platform, its architecture, scheduling model, and workload execution environment.
• Integrate AI training, inference, and pipeline workloads into Fuzzball-based CI/CD and production pipelines.
• Contribute to Fuzzball’s AI workload narrative, ensuring the platform serves as an optimal environment for running AI workloads efficiently and at scale.
• Assist in characterizing and enhancing Fuzzball’s performance for AI-specific access patterns and resource requirements.
• Develop a comprehensive understanding of the complete CIQ product portfolio, including Rocky Linux, RLC (and its variants), Fuzzball, Apptainer, and Warewulf, and how AI workloads interact with each component.
• Work closely with the Performance Engineering team to ensure that AI workloads benefit from and contribute to CIQ’s systems-level optimization initiatives.
• Extensive, hands-on experience in optimizing LLM inference, including serving frameworks (vLLM, TensorRT-LLM, ONNX Runtime), quantization techniques, and GPU memory management.
• Strong background in distributed AI training, with familiarity in frameworks such as PyTorch FSDP, DeepSpeed, Megatron-LM, or JAX/XLA.
• Proven track record in building production AI pipelines and packaging AI environments for reproducible and portable deployment (containers, Apptainer/Singularity, or equivalent).
• Proficiency with GPU/accelerator profiling tools: NVIDIA Nsight, PyTorch Profiler, CUDA performance analysis, and related tools.
• Knowledge of HPC environments, including job schedulers (Slurm, PBS), parallel filesystems, RDMA/InfiniBand, and MPI, along with the intersection of HPC and modern AI workloads.
• Experience in integrating AI workloads into CI/CD pipelines and developing automated testing and benchmarking frameworks.
• Comfortable using and developing with LLM-based tools and agentic frameworks to enhance engineering productivity.
• Exceptional analytical skills, capable of formulating hypotheses, designing experiments, and deriving actionable conclusions from complex profiling data.
• Strong written and verbal communication abilities, able to present findings to both highly technical audiences and business stakeholders.
• A collaborative, humble, and continuously learning mindset, paired with the confidence to advocate for AI engineering as a priority.
• Medical, dental, and vision insurance.
• Flexible paid time off.
• Employee stock options.
• Remote work; no travel required for most positions.
NICE
Oxfam America
Volkswagen Group
Volkswagen Group
Get handpicked remote jobs straight to your inbox weekly.