Job ID: JOB_ID_736

Position Summary: SRE for Machine Learning Platforms

We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) specializing in Machine Learning (ML) platforms to join our engineering team. This role is critical for maintaining the reliability, scalability, and performance of our cutting-edge ML infrastructure in Austin, TX, and Sunnyvale, CA. As an SRE, you will work at the intersection of software engineering and systems administration, focusing on the operational health of our ML Ops pipelines and the underlying cloud infrastructure. The ideal candidate will have a strong background in Kubernetes, Python, and AWS, with a passion for automating complex workflows and supporting Large Language Models (LLMs).

Core Technical Responsibilities

  • Design, implement, and manage cloud-native solutions on AWS, ensuring high availability and fault tolerance for ML workloads.
  • Orchestrate and optimize Kubernetes clusters to support containerized microservices and ML model training/inference.
  • Develop and maintain robust MLOps pipelines to automate the deployment, monitoring, and scaling of machine learning models.
  • Utilize Python for advanced automation, tool development, and software engineering tasks within the SRE domain.
  • Administer and tune NoSQL databases like MongoDB and search engines like Apache SOLR to ensure data integrity and low-latency access.
  • Build and manage CI/CD pipelines tailored for ML operations, integrating automated testing and security scans.
  • Perform Linux system administration, including performance tuning, kernel optimization, and security hardening.
  • Collaborate with data scientists to understand their tooling requirements and provide a stable, scalable platform for model development.

Operational Excellence and Innovation

In this role, you will be responsible for defining and meeting Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for the ML platform. You will participate in on-call rotations and lead incident response efforts, conducting thorough post-mortems to prevent recurring issues. A significant portion of your time will be dedicated to “toil reduction”—identifying manual processes and replacing them with automated, self-healing systems. You will also stay abreast of the latest trends in AI and ML infrastructure, including the deployment and optimization of Large Language Models (LLMs) and the integration of modern observability tools like Prometheus and Grafana.

Qualifications and Requirements

Candidates must have at least 6 years of experience in MLOps, SRE, or a related DevOps role. Proficiency in Python and Kubernetes is mandatory. You should have a deep understanding of cloud architecture (AWS preferred, but Azure or GCP is acceptable) and experience with CI/CD methodologies. Familiarity with ML frameworks and the software development lifecycle (SDLC) is essential. This is a hybrid role requiring three days of onsite presence at our offices in Austin, TX, or Sunnyvale, CA. Strong problem-solving skills and the ability to work effectively in a collaborative, fast-paced environment are required.


Special Requirements

Local candidates to Austin, TX or Sunnyvale, CA only; No OPT or CPT candidates; Hybrid work model requiring 3 days onsite per week.


Compensation & Location

Salary: $65 – $65 per year

Location: Austin, TX


Recruiter / Company – Contact Information

Recruiter / Employer: Resource Consulting Services Inc. (R Consulting Inc)

Email: suryangi@rconsultinginc.com


Interested in this position?
Apply via Email

Recruiter Notice:
To remove this job posting, please send an email from
suryangi@rconsultinginc.com with the subject:

DELETE_JOB_ID_736

to delete@join-this.com.