This is a fully remote position, open to applicants in Spain.

📋 Description

• Innovate within model serving and inference architectures for cutting-edge AI systems.

• Concentrate on enhancing model deployment and inference techniques to achieve highly responsive, efficient, and scalable performance.

• Engage with a diverse range of systems, from resource-efficient models suited for constrained hardware to intricate, multi-modal architectures.

• Develop resilient inference pipelines, set up thorough performance metrics, and pinpoint as well as resolve bottlenecks.

• Facilitate high-throughput, low-latency, minimal memory usage, and scalable AI performance that provides significant value in dynamic, real-world contexts.

• Conceptualize and implement state-of-the-art model serving architectures that ensure high throughput and low latency while maximizing memory efficiency.

• Construct, execute, and oversee controlled inference tests in both simulated and live production settings.

• Monitor key performance indicators such as response latency, throughput, memory utilization, and error rates.

• Document iterative findings and benchmark results against established standards.

• Identify and curate high-quality test datasets and simulation scenarios that address real-world deployment challenges.

⛳️ Requirements

• A bachelor's degree in Computer Science or a related discipline.

• Preferably a PhD in NLP, Machine Learning, or a related field, with a robust history in AI research and development (including notable publications in top-tier conferences).

• Proficient understanding of Metal Shading Language (MSL).

• Essential experience in low-level kernel optimizations and inference enhancements on mobile devices.

• Required deep familiarity with contemporary model serving architectures and inference optimization methods.

• Strong capabilities in writing GPU kernels for mobile devices (e.g., smartphones) alongside an in-depth understanding of model serving frameworks and engines.

• Practical experience in creating and deploying comprehensive inference pipelines.

• Proven aptitude for applying empirical research to tackle challenges in model serving.

• Experience with Distributed Inference Systems: Designing and optimizing high-performance inference engines utilizing techniques such as Tensor Parallelism, Pipeline Parallelism, and Expert Parallelism to manage extensive models on GPU clusters.

• A thorough understanding of the mathematics and structure underlying Diffusion Models and Vision Transformers.

🏝️ Benefits

• Health insurance.

• Flexibility to work from anywhere.

• Opportunities for professional development.

AI Research Engineer, Kernel & Inference Optimization

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

AI Research Engineer – Applied AI

AI Research Engineer – Model Compression, Quantization

AI Research Engineer – Agentic Post-training

AI Research Engineer, Model Compression – Quantization

Clinical AI Research Lead

AI Researcher

Never miss a great job!