
AI Research Engineer, Model Compression, Quantization
Posted May 30

Posted May 30
This is a fully remote position, open to applicants in Brazil.
• Implement low-bit quantization techniques aimed at decreasing model size and inference latency for generative AI models (LLMs, VLMs, multimodal), while preserving accuracy and output quality.
• Utilize knowledge distillation to transfer functionalities from larger teacher models to smaller student models, facilitating efficient multimodal reasoning across text, image, and audio inputs.
• Apply pruning methods to eliminate redundant parameters and attention heads, thus minimizing computational demands without compromising task performance.
• Evaluate the trade-offs between model efficiency (size, latency, memory) and accuracy across quantization, distillation, and pruning techniques, and suggest enhancements based on empirical data.
• Conduct research and implement mixed-precision quantization along with other advanced compression strategies (e.g., adaptive pruning schedules, distillation with intermediate feature matching) to optimize the balance between accuracy and performance.
• Stay up-to-date with the latest advancements in model compression, focusing on emerging techniques for multimodal and generative architectures.
• Clearly document methodologies, experiments, and results to ensure reproducibility, facilitate internal collaboration, and enhance stakeholder communication.
• Write and publish technical papers in prestigious conferences (e.g., NeurIPS, ICML, ICLR, CVPR, ACL, AAAI) to contribute to the advancement of model compression in multimodal AI.
• A degree in Computer Science or a related discipline.
• Preferably a PhD in NLP, Machine Learning, or a related field, supported by a strong record in AI R&D (with notable publications in A* conferences).
• Proficiency in PyTorch deep learning frameworks or similar frameworks.
• Practical experience with model quantization, including both Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ).
• Research and practical experience in knowledge distillation for compressing large models into smaller, more efficient versions.
• Research and practical experience in model pruning for reducing large models to smaller, efficient alternatives.
• Strong understanding of neural network architectures and training methodologies, including transformers (e.g., LLMs, VLMs), backpropagation, optimization, and fine-tuning techniques.
• Familiarity with C++ is advantageous (especially for implementing low-level quantization kernels or inference optimizations).
• Flexible working arrangements.
• Opportunities for professional development.
PlexTrac
Tether.to
Tether.to
Tether.to
Get handpicked remote jobs straight to your inbox weekly.