
AI Research Engineer – Kernel & Inference Optimization
Posted May 21

Posted May 21
This is a fully remote position, open to applicants in Ireland.
• Architect and implement advanced model serving frameworks that ensure high throughput and low latency while optimizing memory consumption.
• Guarantee the efficient operation of these pipelines across various environments, including those with limited resources and edge computing platforms.
• Set clear performance benchmarks such as decreased latency, enhanced token response, and reduced memory usage.
• Construct, execute, and oversee controlled inference tests in both simulated and live production scenarios.
• Monitor key performance metrics such as response latency, throughput, memory usage, and error rates, with a particular focus on metrics pertinent to resource-constrained devices.
• Document iterative findings and compare results against predefined benchmarks to validate performance across different platforms.
• Identify and curate high-quality test datasets and simulation scenarios specifically designed to address real-world deployment challenges, particularly on low-resource devices.
• Establish measurable criteria to ensure that these resources effectively assess model performance, latency, and memory utilization under diverse operational conditions.
• Examine computational efficiency and identify bottlenecks in the serving pipeline by tracking both processing and memory metrics.
• Resolve issues such as inefficient batch processing, network delays, and excessive memory consumption to enhance the serving infrastructure for scalability and reliability on resource-limited systems.
• Collaborate closely with cross-functional teams to integrate optimized serving and inference frameworks into production pipelines aimed at edge and on-device applications.
• Define explicit success metrics such as enhanced real-world performance, low error rates, strong scalability, and optimal memory utilization, ensuring ongoing monitoring and iterative refinements for continuous improvements.
• A degree in Computer Science or a related discipline.
• Ideally a PhD in NLP, Machine Learning, or a similar field, supported by a proven history in AI R&D (with notable publications in A* conferences).
• Must possess knowledge of Metal Shading Language (MSL).
• Essential experience in low-level kernel optimizations and inference refinement on mobile devices.
• Your contributions should have resulted in measurable advancements in inference latency, throughput, and memory footprint for domain-specific applications, especially on resource-constrained devices and edge platforms.
• A comprehensive understanding of contemporary model serving architectures and inference optimization strategies is required.
• Must have significant expertise in crafting GPU kernels for mobile devices (e.g., smartphones) along with a profound understanding of model serving frameworks and engines.
• Practical experience in creating and executing end-to-end inference pipelines, from optimizing models for efficient serving to deploying these solutions on resource-limited devices is necessary.
• Demonstrated capability to apply empirical research to tackle challenges in model serving, such as latency optimization, computational bottlenecks, and memory limitations.
• Proficient in designing robust evaluation frameworks and iterating on optimization strategies to consistently advance inference performance and system efficiency.
• Experience with Distributed Inference Systems: Designing and refining high-performance inference engines using techniques like Tensor Parallelism, Pipeline Parallelism, and Expert Parallelism to manage massive models on GPU clusters.
• A deep understanding of the mathematics and structure behind Diffusion Models and Vision Transformers.
• Familiarity with Pruning, Quantization, Flash attention, KV Cache, Speculative Decoding (Eagle), etc.
• Health insurance
• 401(k) matching
• Flexible work hours
• Paid time off
Tether.to
Insight Timer
Tether.to
Get handpicked remote jobs straight to your inbox weekly.