About Us:
Rivian and Volkswagen Group Technologies is a joint venture between two industry leaders with a clear vision for automotive's next chapter. From operating systems to zonal controllers to cloud and connectivity solutions, we're addressing the challenges of electric vehicles through technology that will set the standards for software-defined vehicles around the world.
The road to the future is uncharted. By combining our expertise across connectivity, AI, security and more, we'll map a new way forward. Working together, we'll create a future that's more connected, more intelligent, more sustainable for everyone.
Role Summary:
Join our dynamic team in the automotive industry as an AI Research Engineer specializing in Video Understanding and Vision Language Models (VLM). Contribute to developing cutting-edge AI solutions for complex video analysis and multimodal interactions. In this role, you will focus on building and optimizing advanced algorithms for video understanding, VLM development, scene analysis, and real-time processing. Your work will enhance vehicle safety, improve user experiences through rich multimodal interactions, and enable innovative in-vehicle applications that leverage both visual and language data.
Responsibilities:
Develop and implement advanced video understanding algorithms, including action recognition, video captioning, and video question answering, integrating VLM approaches.
Design, train, and fine-tune Vision Language Models (VLMs) using large-scale multimodal datasets, particularly automotive and video-centric data.
Deploy and optimize VLM and video processing models for edge devices and real-time automotive applications, ensuring efficient performance and latency.
Collaborate with cross-functional teams to integrate video understanding and VLM solutions into in-vehicle systems and infotainment platforms for enhanced user interaction.
Conduct cutting-edge research to stay up-to-date with the latest advancements in video understanding, VLM technologies, and multimodal learning.
Create scalable, maintainable pipelines for multimodal data processing, VLM training, and deployment.
Document workflows, algorithms, and results for internal and external stakeholders, with a focus on VLM and video analysis techniques.
Qualifications:
MS or PhD in Computer Science, Electrical Engineering, or a related field with a focus on Computer Vision, Natural Language Processing, or Multimodal Learning.
Strong programming skills in Python, with extensive experience in deep learning frameworks like TensorFlow or PyTorch, and specifically with VLM libraries and frameworks.
Proficiency in computer vision techniques, including CNNs, Transformers, video modeling, and multimodal architectures.
Hands-on experience with large-scale video data and VLM tasks such as video captioning, video question answering, and multimodal retrieval.
Familiarity with image and video processing libraries (e.g., OpenCV, scikit-image) and natural language processing libraries (e.g., Hugging Face Transformers).
Experience with hardware accelerators (e.g., GPUs, NPUs) for training and deploying complex VLM and video models.
MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.