This is a fully remote position, open to applicants in United States.

📋 Description

• Design, develop, and sustain extensive benchmarking frameworks that encompass OS, kernel, and application layers.

• Profile workloads across CPU, memory, I/O, network, and accelerator (GPU/NPU) subsystems to uncover bottlenecks and optimization possibilities.

• Establish and take ownership of performance baselines across CIQ's suite of products and solutions.

• Utilize AI-assisted tools and agentic workflows to expedite profiling, analysis, and identification of root causes.

• Construct and manage automated performance regression-detection pipelines that are integrated into CI/CD workflows using Fuzzball.

• Identify, triage, and address regressions in user space, kernel space, and application layers with promptness and diligence.

• Collaborate with engineering teams to determine the root causes of regressions stemming from upstream kernel alterations, compiler updates, or library changes.

• Promote proactive performance enhancements—rather than just reactive fixes—to maintain CIQ solutions' competitive edge across every stack layer.

• Oversee core operating system performance: tuning kernel subsystems (scheduler, memory management, I/O, networking), reducing system call overhead, and optimizing user space libraries and runtimes.

• Identify and enact kernel-level improvements, including patches, configuration adjustments, and contributions to upstream that yield significant performance gains for CIQ's customer workloads.

• Optimize AI inference and training workloads, including LLM serving, model parallelism, and accelerator utilization.

• Fine-tune performance for HPC workloads, including modeling, simulation, and tightly integrated parallel applications (MPI, OpenMP, etc.).

• Enhance general computing and service workloads - covering web services, databases, messaging systems, and other production software operating on CIQ's OS platform.

• Operate at all levels of the stack: compiler flags, kernel parameters, scheduler tuning, NUMA topology, memory allocation, and application-level algorithmic enhancements.

• Advocate for an AI-first engineering philosophy—leveraging AI tools, agents, and automation to boost personal productivity and the quality of performance insights.

• Identify and prioritize optimization avenues that have a direct impact on AI training throughput and inference latency/cost.

• Stay up-to-date with cutting-edge techniques in ML system performance, including quantization, batching strategies, kernel fusion, and hardware-software co-design.

• Cultivate in-depth knowledge of CIQ's Fuzzball platform—its architecture, scheduling, and workload execution model.

• Integrate performance benchmarks, regression tests, and user-facing workloads into Fuzzball-based pipelines.

• Contribute to the performance characterization of Fuzzball itself, ensuring minimal overhead and efficient scalability of the platform.

• Develop a comprehensive understanding of the full CIQ product portfolio—including Rocky Linux and RLC (and its variants), Fuzzball, Apptainer (formerly Singularity), and Warewulf—while recognizing how performance considerations interconnect across all.

• Collaborate closely with the engineering teams behind each product line to identify, prioritize, and implement performance enhancements that benefit customers throughout the entire CIQ ecosystem.

• Partner with product and customer success teams to translate real-world performance challenges into engineering priorities and measurable results.

• Document and communicate findings effectively—from low-level profiling data to high-level executive summaries.

• Contribute to technical publications, conference presentations, and thought leadership that bolsters CIQ's reputation for performance excellence.

⛳️ Requirements

• A profound, principled understanding of operating system internals—Linux kernel scheduler, memory subsystem, I/O stack, and networking.

• Proven experience in identifying and resolving performance regressions across kernel and user space within production environments.

• Hands-on proficiency with profiling and tracing tools: perf, eBPF/bpftrace, Flamegraphs, VTune, Nsight, strace, ftrace, and similar tools.

• Strong foundation in AI/ML workload performance—including inference optimization (TensorRT, ONNX, vLLM, or similar), training efficiency, and GPU/accelerator utilization.

• Experience with HPC workloads: MPI, OpenMP, parallel filesystems, RDMA/InfiniBand, and job schedulers (Slurm, PBS, etc.).

• Familiarity with modern AI-centric development workflows and comfort in using LLM-based tools to enhance engineering productivity.

• Experience in developing automated performance testing and regression detection pipelines within CI/CD settings.

• Exceptional analytical abilities—capable of forming hypotheses, designing experiments, and deriving actionable conclusions from complex datasets.

• Strong verbal and written communication skills; adept at presenting findings to both highly technical audiences and business stakeholders.

• A collaborative, humble, and continuously learning mindset—paired with the confidence to advocate for performance as a paramount engineering concern.

🏝️ Benefits

• Medical, dental, and vision insurance.

• Flexible paid time off.

• Employee stock options.

• Remote work; no travel required for most positions.

Senior/Principal Performance Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Mixed-Signal Verification Engineer

Operational Technology Engineer

Lead Forward Deployed Engineer, Databricks

Senior Transmission Planning Engineer

Code Reviewer Engineer

Engineer

Never miss a great job!