
Senior/Principal Performance Engineer
Posted Jun 19

Posted Jun 19
This is a fully remote position, open to applicants in United States.
• Design, develop, and sustain extensive benchmarking frameworks that encompass OS, kernel, and application layers.
• Analyze workloads across CPU, memory, I/O, network, and accelerator (GPU/NPU) subsystems to pinpoint bottlenecks and areas for optimization.
• Establish and take ownership of performance baselines throughout CIQ's product and solutions portfolio.
• Utilize AI-assisted tools and agentic workflows to expedite profiling, analysis, and identification of root causes.
• Create and manage automated performance regression-detection pipelines that are integrated into CI/CD workflows using Fuzzball.
• Identify, triage, and resolve regressions in user space, kernel space, and application layers with a sense of urgency and thoroughness.
• Collaborate with engineering teams to trace regressions caused by upstream kernel changes, compiler updates, or library modifications.
• Proactively drive performance enhancements—focusing on advancements rather than just reactive solutions—to maintain CIQ's competitive edge across all stack layers.
• Oversee core operating system performance, including kernel subsystem tuning (scheduler, memory management, I/O, networking), system call overhead reduction, and optimizations for user space libraries and runtimes.
• Identify and apply kernel-level enhancements, such as patches, configuration changes, and upstream contributions that yield measurable performance improvements for CIQ's customer workloads.
• Optimize workloads for AI inference and training, including LLM serving, model parallelism, and accelerator utilization.
• Fine-tune performance for HPC workloads, including modeling, simulation, and tightly coupled parallel applications (MPI, OpenMP, etc.).
• Enhance general computing and service workloads—including web services, databases, messaging systems, and other production software on CIQ's OS platform.
• Operate at all stack levels: adjusting compiler flags, kernel parameters, scheduler settings, NUMA topology, memory allocation, and application-level algorithmic improvements.
• Advocate for an AI-first engineering philosophy—utilizing AI tools, agents, and automation to enhance both personal productivity and the quality of performance insights.
• Identify and prioritize optimization opportunities that significantly affect AI training throughput and inference latency/cost.
• Remain up to date on cutting-edge techniques in ML system performance, including quantization, batching strategies, kernel fusion, and hardware-software co-design.
• Develop in-depth expertise in CIQ's Fuzzball platform, focusing on its architecture, scheduling, and workload execution model.
• Integrate performance benchmarks, regression tests, and user-facing workloads into Fuzzball-based pipelines.
• Contribute to the performance characterization of Fuzzball, ensuring minimal overhead and efficient scaling of the platform.
• Gain comprehensive familiarity with CIQ's entire product portfolio—including Rocky Linux and RLC (and its variants), Fuzzball, Apptainer (formerly Singularity), and Warewulf—understanding how performance factors interconnect across each.
• Collaborate extensively with engineering teams behind each product line to highlight, prioritize, and implement performance improvements that benefit customers throughout the CIQ ecosystem.
• Partner with product and customer success teams to translate real-world performance challenges into engineering priorities and measurable outcomes.
• Clearly document and communicate findings—from low-level profiling data to executive-level summaries.
• Contribute to technical publications, conference presentations, and thought leadership that reinforces CIQ's commitment to performance excellence.
• Profound and principled understanding of operating system internals, including the Linux kernel scheduler, memory subsystem, I/O stack, and networking.
• Proven track record in identifying and resolving performance regressions in both kernel and user space within production settings.
• Practical expertise with profiling and tracing tools such as perf, eBPF/bpftrace, Flamegraphs, VTune, Nsight, strace, ftrace, and others.
• Strong background in AI/ML workload performance, encompassing inference optimization (TensorRT, ONNX, vLLM, or similar), training efficiency, and GPU/accelerator utilization.
• Experience with HPC workloads, including MPI, OpenMP, parallel filesystems, RDMA/InfiniBand, and job schedulers (Slurm, PBS, etc.).
• Familiarity with modern AI-first development workflows and comfort in using LLM-based tools to accelerate engineering tasks.
• Experience in constructing automated performance testing and regression detection pipelines within CI/CD frameworks.
• Exceptional analytical abilities—capable of forming hypotheses, designing experiments, and deriving actionable insights from complex data.
• Strong written and verbal communication skills; adept at presenting findings to both technical audiences and business stakeholders.
• A collaborative, humble, and continuously learning mindset, paired with the confidence to advocate for performance as a primary engineering concern.
• Medical, dental, and vision insurance.
• Flexible paid time off.
• Employee stock options.
• Remote work; no travel required for most positions.
Greencells Group
Teamficient
Get handpicked remote jobs straight to your inbox weekly.