
AI Research Engineer – Kernel & Inference Optimization
Posted May 30

Posted May 30
This is a fully remote position, open to applicants in North America.
• Lead the charge in innovating model serving and inference architectures for cutting-edge AI systems.
• Concentrate on enhancing model deployment and inference methodologies.
• Engage with a diverse range of systems, from efficient models to intricate, multi-modal architectures.
• Design, test, and execute innovative serving strategies and inference algorithms.
• Create robust inference pipelines, set performance benchmarks, and address bottlenecks within production settings.
• Facilitate high-throughput, low-latency, low-memory footprint, and scalable AI performance that yields significant value.
• A degree in Computer Science or a related discipline.
• Preferably a PhD in NLP, Machine Learning, or a comparable field, bolstered by a strong history in AI research and development (with notable publications in A* conferences).
• Familiarity with Metal Shading Language (MSL) is essential.
• Demonstrated experience in low-level kernel optimizations and inference enhancements on mobile devices is crucial.
• Your work should have resulted in quantifiable improvements in inference latency, throughput, and memory usage for specialized applications, especially on devices with limited resources and edge platforms.
• A thorough understanding of contemporary model serving architectures and inference optimization strategies is required.
• Strong proficiency in developing GPU kernels for mobile devices (e.g., smartphones).
• Hands-on experience in creating and deploying comprehensive inference pipelines, from model optimization for efficient serving to integrating these solutions on resource-limited devices is necessary.
• Proven ability to leverage empirical research to tackle challenges in model serving, including latency reduction, computational bottlenecks, and memory limitations.
• Skilled in designing solid evaluation frameworks and refining optimization strategies to constantly advance inference performance and system efficiency.
• Experience in Distributed Inference Systems: Designing and refining high-performance inference engines using techniques such as Tensor Parallelism, Pipeline Parallelism, and Expert Parallelism to manage large models on GPU clusters.
• Profound understanding of the mathematics and framework behind Diffusion Models and Vision Transformers.
• Health insurance
• Flexible working hours
• Paid time off
• Professional development opportunities
Tether.to
Insight Timer
Tether.to
Get handpicked remote jobs straight to your inbox weekly.