This is a fully remote position, open to applicants in Switzerland.

📋 Description

• Lead the charge in innovating model serving and inference architectures for cutting-edge AI systems.

• Prioritize the optimization of model deployment and inference strategies to achieve highly responsive, efficient, and scalable performance in real-world applications.

• Engage with a diverse array of systems, ranging from resource-efficient models tailored for limited hardware settings to intricate, multi-modal architectures that combine data types such as text, images, and audio.

• Employ a hands-on, research-oriented methodology to devise, test, and implement groundbreaking serving strategies and inference algorithms.

• Construct robust inference pipelines, set comprehensive performance metrics, and pinpoint and resolve bottlenecks within production environments.

• Facilitate high-throughput, low-latency, low-memory footprint, and scalable AI performance that provides significant value in dynamic and real-world contexts.

⛳️ Requirements

• A degree in Computer Science or a related discipline.

• Ideally, a PhD in NLP, Machine Learning, or a related field, supported by a strong record in AI R&D (including notable publications in A* conferences).

• Must possess knowledge of Metal Shading Language (MSL).

• Proven experience in low-level kernel optimizations and inference optimization on mobile devices is crucial.

• Your contributions should have resulted in measurable enhancements in inference latency, throughput, and memory footprint for domain-specific applications, particularly on resource-constrained devices and edge platforms.

• A thorough understanding of contemporary model serving architectures and inference optimization strategies is essential.

• Strong expertise in writing GPU kernels for mobile devices (e.g., smartphones) is required.

• Practical experience in the development and deployment of end-to-end inference pipelines, from model optimization for efficient serving to integrating solutions on resource-constrained devices, is necessary.

• Demonstrated capability to apply empirical research to address challenges in model serving, including latency optimization, computational bottlenecks, and memory limitations.

• Proficient in designing robust evaluation frameworks and refining optimization strategies to continuously enhance inference performance and system efficiency.

• Experience with Distributed Inference Systems: designing and optimizing high-performance inference engines using methods like Tensor Parallelism, Pipeline Parallelism, and Expert Parallelism to manage large models on GPU clusters.

• A deep understanding of the mathematics and structure underlying Diffusion Models and Vision Transformers.

• Familiarity with techniques such as Pruning, Quantization, Flash attention, KV Cache, and Speculative Decoding (Eagle), among others.

🏝️ Benefits

• Flexible work arrangements.

• Professional development opportunities.

AI Research Engineer – Kernel, Inference Optimization

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

AI Research Engineer, Model Compression – Quantization

Clinical AI Research Lead

AI Research Engineer – Pre-training, LLM, Multi-Modal

Clinical AI Research Assistant

ML Researcher

AI Researcher

Never miss a great job!