Software Engineer - Codec Avatar ML Compute Team

Pittsburgh, Pennsylvania

Employer: Meta

Industry:

Salary: Competitive

Job type: Full-Time

Reality Labs Research (RL-R) brings together a diverse and highly interdisciplinary team of researchers and engineers to create the future of augmented and virtual reality. On the Codec Avatars ML Compute team, you'll work on building tools, libraries, and frameworks that will help researchers collaborate with each other and empower their research towards the generation of Codec Avatars. Our team cultivates an honest and considerate environment where self-motivated individuals thrive. We encourage a strong sense of ownership and embrace the ambiguity that comes with working on the frontiers of research. In this software engineer role on the Codec Avatar ML Compute team, you will serve as the point of contact for Meta's research GPU super clusters, managing and optimizing compute resources to enable groundbreaking research in relightable avatars, full-body avatars, and generative AI for codec avatars.

Software Engineer - Codec Avatar ML Compute Team Responsibilities

Build, scale, and secure the HPC clusters within Meta research labs, a heterogeneous environment containing diverse operating systems and applications

Provide on-call support and lead incident root cause analysis through multiple infrastructure layers (compute, storage, network) for HPC clusters and act as a final escalation point

Collaborate in a diverse team environment across multiple scientific and engineering disciplines, making the architectural tradeoffs required to rapidly deliver software and infrastructure solutions

Find ways to leverage the scale and complexity of the larger Meta production infrastructure to solve problems for Reality Lab researchers

Provide guidance to other engineers on best practices to build mature services which are highly available, reliable, secure, and scalable

Provide guidance to other engineers on best practices to build mature services which are highly available, reliable, secure, and scalable

Ability to work independently, handle large projects simultaneously, and prioritize team roadmap and deliverables by balancing required effort with resulting impact

Minimum Qualifications

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.

Experience in automating the management of infrastructure and services

3+ years experience in distributed system performance measurement, logging, and optimization

3+ years experience coding in at least one of the following languages: C++, Python, Rust, or Go

Thorough understanding of Linux operating system internals, including the networking subsystem

Experience with Python library management systems such as Conda or Python venv

Experience in writing system level infrastructure, libraries, and applications

Experience with software development practices such as source control, code reviews, unit testing, debugging and profiling

Proven track record of shipping software

Experience in developing performant software and systems

Preferred Qualifications

Experience with managing HPC scheduler libraries like Slurm, Kubernetes, or LSF

Prior experience in building out HPC clusters, handling compute, storage, network, operating systems, schedulers, and stakeholder discussions

Prior experience in cluster oncall operations, including troubleshooting server/scheduler/storage errors, maintaining compute/storage environments/libraries/tools, helping onboard users to the cluster, and answering general questions from users

Prior experience in cluster coordination and strategy planning, including collecting/understanding needs of users, developing tools to improve user experience, providing guidance on best practices, coordinating distribution of compute/storage resources, forecasting compute/storage needs, and developing long-term user experience/compute/storage strategies

Prior experience building tooling for monitoring and telemetry

Prior experience supporting configuration management in a multi-region environment

Prior experience optimizing multi-tenant HPC clusters for performance and maintenance

Prior experience with containerization technologies like Docker or Virtual Machines

Prior experience building services

Prior experience building PaaS or internal clouds

Prior experience in developing/managing distributed network file systems

Prior academic or development experience with machine learning and/or deep learning

Prior experience in ML libraries such as PyTorch, TensorFlow or cuDNN

Prior experience in GPGPU development with CUDA, OpenCL or DirectCompute

Prior experience in network security

Experience in database and data management systems at scale

Familiar with Linux observability tools, such as eBPF

Start preparing
Learn about how to prepare for your interview with our interview guide, tips, and interactive experiences.
Visit interview prep

Created: 2024-04-29

Reference: 697439102331553

Country: United States

State: Pennsylvania

City: Pittsburgh

ZIP: 15216

Similar jobs:

Software Development Engineer, Ring

Amazon in Malvern, Pennsylvania
Software Engineer - Java

Veeva Systems in Philadelphia, Pennsylvania
Software Engineer

Celandyne Software Solutions in Clayton, Pennsylvania
Software Engineer III

LexisNexis Risk Solutions in HOME-BASED, Pennsylvania
Software Development Engineer IV-

ASK Staffing in King of Prussia, Pennsylvania
Software Controls Engineer

System One Holdings, LLC in York, Pennsylvania
Software Development Engineer, Ring

Amazon in Malvern, Pennsylvania
Software Engineer III - Genomics

Danaher Corporation in Philadelphia, Pennsylvania

💸 $108000 - $125000. per year
Software Developer/Engineer Consultant (Mid level)

Indotronix International Corporation in Philadelphia, Pennsylvania
Software Development Engineer, Ring

Amazon in Malvern, Pennsylvania
Senior Software Engineer - React

Veeva Systems in Philadelphia, Pennsylvania
Lead Software Engineer, Infrastructure

eSmartloan in Philadelphia, Pennsylvania
.Net Software Engineer

Apex Systems in Mechanicsburg, Pennsylvania
Systems Generalist Software Engineer - Live Telepresence with Codec Avatars

Meta in Pittsburgh, Pennsylvania
Software Engineer 3

Chipton Ross in Ridley Park, Pennsylvania
Senior Software Engineer, Developer Productivity

MongoDB in United States, Pennsylvania
Test Engineer- Software & Validation

Airswift in Pittsburg, Pennsylvania
Software Test Engineer

Webstaurant Store, Inc. in Lititz, Pennsylvania
Software Engineer

System One Holdings, LLC in Pittsburgh, Pennsylvania
Software Engineer - NAVSUP OIS - Remote - CLEARANCE REQUIRED

General Dynamics Corporation in Mechanicsburg, Pennsylvania

💸 $89250 - $120750. per year