Remotery

AI Research Engineer – Kernel, Inference Optimization

Posted Jun 4

This is a fully remote position, open to applicants in Netherlands.

📋 Description

• Design and implement model serving architectures that achieve high throughput and minimal latency.

• Ensure efficient operation of pipelines across various environments, including resource-limited devices and edge platforms.

• Set clear performance benchmarks for latency and memory utilization.

• Conduct, manage, and oversee controlled inference tests.

• Monitor key performance indicators such as response latency and memory usage.

• Document iterative findings and evaluate results against established benchmarks.

• Assess computational efficiency and identify bottlenecks within the serving pipeline.

• Collaborate with cross-functional teams to integrate optimized frameworks into production systems.

• Establish success metrics aimed at enhancing performance and scalability.


⛳️ Requirements

• A degree in Computer Science or a related discipline.

• Preferably a PhD in NLP, Machine Learning, or a related field, along with a strong record in AI R&D (with notable publications in top-tier conferences).

• Familiarity with Metal Shading Language (MSL).

• Proficient in creating custom compute shaders from the ground up.

• Demonstrated experience in low-level kernel optimizations and inference enhancements for mobile devices.

• Contributions should have led to improvements in inference latency, throughput, and memory efficiency for specific applications.

• A comprehensive understanding of contemporary model serving architectures and inference optimization strategies.

• Strong skills in writing GPU kernels for mobile platforms.

• Hands-on experience in developing and deploying end-to-end inference pipelines.

• Ability to apply empirical research to address challenges in model serving.

• Skilled in designing robust evaluation frameworks and refining optimization methods.

• Experience with Distributed Inference Systems that utilize Tensor Parallelism, Pipeline Parallelism, and Expert Parallelism.

• Knowledge of Pruning, Quantization, Flash attention, KV Cache, and Speculative Decoding (Eagle).


🏝️ Benefits

• Work remotely from any location around the globe.

• Opportunity to innovate within the fintech sector.

• Collaborate with talent from around the world.

• Competitive compensation packages available.

• Flexible working arrangements offered.

People also viewed

Tether.to37 min ago

AI Research Engineer – Agentic Post-training

Anywhere in the WorldFull-timeAI Research Scientist
ApplyView job
Tether.to37 min ago

AI Research Engineer – Model Compression, Quantization

AE flagUnited Arab Emirates (UAE) OnlyFull-timeAI Research Scientist
ApplyView job
PlexTrac37 min ago

AI Research Engineer – Applied AI

IN flagIndia OnlyFull-timeAI Research Scientist
ApplyView job
Tether.to12 hours ago

AI Research Engineer, Model Compression – Quantization

CH flagSwitzerland OnlyFull-timeAI Research Scientist
ApplyView job
Insight Timer6 days ago

Clinical AI Research Lead

AU flagAustralia OnlyFull-timeAI Research Scientist
ApplyView job
Insight Timer6 days ago

Clinical AI Research Assistant

AU flagAustralia OnlyFull-timeAI Research Scientist
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers