This is a fully remote position, open to applicants in Brazil.

📋 Description

• Implement low-bit quantization techniques aimed at decreasing model size and inference latency for generative AI models (LLMs, VLMs, multimodal), while preserving accuracy and output quality.

• Utilize knowledge distillation to transfer functionalities from larger teacher models to smaller student models, facilitating efficient multimodal reasoning across text, image, and audio inputs.

• Apply pruning methods to eliminate redundant parameters and attention heads, thus minimizing computational demands without compromising task performance.

• Evaluate the trade-offs between model efficiency (size, latency, memory) and accuracy across quantization, distillation, and pruning techniques, and suggest enhancements based on empirical data.

• Conduct research and implement mixed-precision quantization along with other advanced compression strategies (e.g., adaptive pruning schedules, distillation with intermediate feature matching) to optimize the balance between accuracy and performance.

• Stay up-to-date with the latest advancements in model compression, focusing on emerging techniques for multimodal and generative architectures.

• Clearly document methodologies, experiments, and results to ensure reproducibility, facilitate internal collaboration, and enhance stakeholder communication.

• Write and publish technical papers in prestigious conferences (e.g., NeurIPS, ICML, ICLR, CVPR, ACL, AAAI) to contribute to the advancement of model compression in multimodal AI.

⛳️ Requirements

• A degree in Computer Science or a related discipline.

• Preferably a PhD in NLP, Machine Learning, or a related field, supported by a strong record in AI R&D (with notable publications in A* conferences).

• Proficiency in PyTorch deep learning frameworks or similar frameworks.

• Practical experience with model quantization, including both Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ).

• Research and practical experience in knowledge distillation for compressing large models into smaller, more efficient versions.

• Research and practical experience in model pruning for reducing large models to smaller, efficient alternatives.

• Strong understanding of neural network architectures and training methodologies, including transformers (e.g., LLMs, VLMs), backpropagation, optimization, and fine-tuning techniques.

• Familiarity with C++ is advantageous (especially for implementing low-level quantization kernels or inference optimizations).

🏝️ Benefits

• Flexible working arrangements.

• Opportunities for professional development.

AI Research Engineer, Model Compression, Quantization

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

AI Research Engineer – Applied AI

AI Research Engineer – Model Compression, Quantization

AI Research Engineer – Agentic Post-training

AI Research Engineer, Model Compression – Quantization

Clinical AI Research Lead

AI Researcher

Never miss a great job!