This is a fully remote position, open to applicants in Ireland.

📋 Description

• Architect and implement advanced model serving frameworks that ensure high throughput and low latency while optimizing memory consumption.

• Guarantee the efficient operation of these pipelines across various environments, including those with limited resources and edge computing platforms.

• Set clear performance benchmarks such as decreased latency, enhanced token response, and reduced memory usage.

• Construct, execute, and oversee controlled inference tests in both simulated and live production scenarios.

• Monitor key performance metrics such as response latency, throughput, memory usage, and error rates, with a particular focus on metrics pertinent to resource-constrained devices.

• Document iterative findings and compare results against predefined benchmarks to validate performance across different platforms.

• Identify and curate high-quality test datasets and simulation scenarios specifically designed to address real-world deployment challenges, particularly on low-resource devices.

• Establish measurable criteria to ensure that these resources effectively assess model performance, latency, and memory utilization under diverse operational conditions.

• Examine computational efficiency and identify bottlenecks in the serving pipeline by tracking both processing and memory metrics.

• Resolve issues such as inefficient batch processing, network delays, and excessive memory consumption to enhance the serving infrastructure for scalability and reliability on resource-limited systems.

• Collaborate closely with cross-functional teams to integrate optimized serving and inference frameworks into production pipelines aimed at edge and on-device applications.

• Define explicit success metrics such as enhanced real-world performance, low error rates, strong scalability, and optimal memory utilization, ensuring ongoing monitoring and iterative refinements for continuous improvements.

⛳️ Requirements

• A degree in Computer Science or a related discipline.

• Ideally a PhD in NLP, Machine Learning, or a similar field, supported by a proven history in AI R&D (with notable publications in A* conferences).

• Must possess knowledge of Metal Shading Language (MSL).

• Essential experience in low-level kernel optimizations and inference refinement on mobile devices.

• Your contributions should have resulted in measurable advancements in inference latency, throughput, and memory footprint for domain-specific applications, especially on resource-constrained devices and edge platforms.

• A comprehensive understanding of contemporary model serving architectures and inference optimization strategies is required.

• Must have significant expertise in crafting GPU kernels for mobile devices (e.g., smartphones) along with a profound understanding of model serving frameworks and engines.

• Practical experience in creating and executing end-to-end inference pipelines, from optimizing models for efficient serving to deploying these solutions on resource-limited devices is necessary.

• Demonstrated capability to apply empirical research to tackle challenges in model serving, such as latency optimization, computational bottlenecks, and memory limitations.

• Proficient in designing robust evaluation frameworks and iterating on optimization strategies to consistently advance inference performance and system efficiency.

• Experience with Distributed Inference Systems: Designing and refining high-performance inference engines using techniques like Tensor Parallelism, Pipeline Parallelism, and Expert Parallelism to manage massive models on GPU clusters.

• A deep understanding of the mathematics and structure behind Diffusion Models and Vision Transformers.

• Familiarity with Pruning, Quantization, Flash attention, KV Cache, Speculative Decoding (Eagle), etc.

🏝️ Benefits

• Health insurance

• 401(k) matching

• Flexible work hours

• Paid time off

AI Research Engineer – Kernel & Inference Optimization

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

AI Research Engineer, Model Compression – Quantization

Clinical AI Research Lead

AI Research Engineer – Pre-training, LLM, Multi-Modal

Clinical AI Research Assistant

ML Researcher

AI Researcher

Never miss a great job!