This is a fully remote position, open to applicants in Brazil.

📋 Description

• Propel innovation in model serving and inference architectures for cutting-edge AI systems.

• Concentrate on enhancing model deployment and inference strategies to achieve highly responsive, efficient, and scalable performance across real-world applications.

• Engage with a diverse array of systems, from resource-efficient models tailored for limited hardware environments to intricate, multi-modal architectures that incorporate text, images, and audio.

• Embrace a hands-on, research-oriented approach to create, test, and implement novel serving strategies and inference algorithms.

• Construct robust inference pipelines, establish comprehensive performance metrics, and identify and mitigate bottlenecks in production settings.

• Facilitate high-throughput, low-latency, low-memory footprint, and scalable AI performance that provides tangible benefits in dynamic, real-world contexts.

⛳️ Requirements

• Possess a degree in Computer Science or a related discipline.

• Preferably hold a PhD in NLP, Machine Learning, or a related field, supported by a robust record in AI R&D (with reputable publications in A* conferences).

• Must have expertise in Metal Shading Language (MSL).

• Proven experience in low-level kernel optimizations and inference optimization on mobile devices is crucial.

• Your contributions should have resulted in quantifiable enhancements in inference latency, throughput, and memory footprint for domain-specific applications, particularly on resource-constrained devices and edge platforms.

• A thorough understanding of contemporary model serving architectures and inference optimization techniques is essential.

• Must exhibit strong proficiency in writing GPU kernels for mobile devices (i.e., smartphones) alongside a deep comprehension of model serving frameworks and engines.

• Practical experience in developing and deploying end-to-end inference pipelines, from optimizing models for efficient serving to integrating these solutions on resource-constrained devices, is required.

• Demonstrated capacity to apply empirical research to tackle challenges in model serving, such as optimizing latency, addressing computational bottlenecks, and managing memory constraints.

• Designed and optimized distributed inference systems, employing techniques like Tensor Parallelism, Pipeline Parallelism, and Expert Parallelism to manage large models on GPU clusters.

• Comprehensive understanding of the mathematics and structure underlying Diffusion Models and Vision Transformers.

• Familiarity with Pruning, Quantization, Flash attention, KV Cache, Speculative Decoding (Eagle), and similar concepts.

🏝️ Benefits

• Our team is a global talent powerhouse, collaborating remotely from every corner of the world.

AI Research Engineer – Kernel, Inference Optimization

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

AI Research Engineer, Model Compression – Quantization

Clinical AI Research Lead

AI Researcher

Clinical AI Research Assistant

ML Researcher

AI Research Engineer – Pre-training, LLM, Multi-Modal

Never miss a great job!