This is a fully remote position, open to applicants in Netherlands.

📋 Description

• Design and implement model serving architectures that achieve high throughput and minimal latency.

• Ensure efficient operation of pipelines across various environments, including resource-limited devices and edge platforms.

• Set clear performance benchmarks for latency and memory utilization.

• Conduct, manage, and oversee controlled inference tests.

• Monitor key performance indicators such as response latency and memory usage.

• Document iterative findings and evaluate results against established benchmarks.

• Assess computational efficiency and identify bottlenecks within the serving pipeline.

• Collaborate with cross-functional teams to integrate optimized frameworks into production systems.

• Establish success metrics aimed at enhancing performance and scalability.

⛳️ Requirements

• A degree in Computer Science or a related discipline.

• Preferably a PhD in NLP, Machine Learning, or a related field, along with a strong record in AI R&D (with notable publications in top-tier conferences).

• Familiarity with Metal Shading Language (MSL).

• Proficient in creating custom compute shaders from the ground up.

• Demonstrated experience in low-level kernel optimizations and inference enhancements for mobile devices.

• Contributions should have led to improvements in inference latency, throughput, and memory efficiency for specific applications.

• A comprehensive understanding of contemporary model serving architectures and inference optimization strategies.

• Strong skills in writing GPU kernels for mobile platforms.

• Hands-on experience in developing and deploying end-to-end inference pipelines.

• Ability to apply empirical research to address challenges in model serving.

• Skilled in designing robust evaluation frameworks and refining optimization methods.

• Experience with Distributed Inference Systems that utilize Tensor Parallelism, Pipeline Parallelism, and Expert Parallelism.

• Knowledge of Pruning, Quantization, Flash attention, KV Cache, and Speculative Decoding (Eagle).

🏝️ Benefits

• Work remotely from any location around the globe.

• Opportunity to innovate within the fintech sector.

• Collaborate with talent from around the world.

• Competitive compensation packages available.

• Flexible working arrangements offered.

AI Research Engineer – Kernel, Inference Optimization

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

AI Research Engineer – Agentic Post-training

AI Research Engineer – Model Compression, Quantization

AI Research Engineer – Applied AI

AI Research Engineer, Model Compression – Quantization

Clinical AI Research Lead

Clinical AI Research Assistant

Never miss a great job!