Job ID: JOB_ID_233

Role Summary

We are looking for a seasoned ML Ops Engineer with a strong background in Site Reliability Engineering (SRE) to join our team in either Austin, TX or Sunnyvale, CA. This is a high-impact, onsite role that requires a deep understanding of how to bridge the gap between machine learning development and production-grade reliability. In the 2026 landscape, the ability to scale Large Language Models (LLMs) and complex ML pipelines is paramount. You will be responsible for designing, building, and maintaining the infrastructure that powers our most advanced AI initiatives, ensuring high availability, scalability, and performance.

Core Responsibilities

  • MLOps Pipeline Development: Design and implement robust MLOps pipelines using frameworks such as Kubeflow, MLFlow, or Airflow. You will automate the end-to-end lifecycle of machine learning models, from data ingestion and training to deployment and monitoring in cloud environments (AWS, Azure, or GCP).
  • Kubernetes Orchestration: Manage and optimize large-scale Kubernetes clusters to support containerized ML workloads. You will be responsible for resource allocation, scaling strategies, and ensuring the reliability of microservices architectures.
  • Infrastructure as Code & SRE: Apply SRE principles to ML infrastructure. This includes implementing CI/CD pipelines, managing infrastructure via code, and ensuring that systems are observable, scalable, and resilient. You will handle Linux administration and performance tuning for high-demand environments.
  • Data Systems Management: Work with various database systems, including MongoDB and Apache SOLR. You will ensure that data pipelines are efficient and that the underlying storage solutions can handle the throughput required for real-time ML inference.
  • LLM & Model Integration: Support the deployment and scaling of Large Language Models (LLMs). You will collaborate with data scientists to translate business requirements into technical specifications, ensuring that models are integrated seamlessly into the broader software ecosystem.
  • Collaboration & Improvement: Work closely with cross-functional stakeholders to identify bottlenecks in the ML lifecycle. You will propose and implement system improvements that enhance developer productivity and model performance.

Technical Requirements

  • 6+ years of experience in ML Ops or a related DevOps/SRE role.
  • Expert-level proficiency in Python and Kubernetes.
  • Strong experience with AWS cloud services and containerization (Docker).
  • Proven track record of building and maintaining CI/CD pipelines for machine learning.
  • Familiarity with workflow orchestration tools like Argo or Airflow.
  • Solid understanding of Linux administration and network protocols.

Behavioral & Soft Skills

  • Excellent communication skills with the ability to work effectively in a team-oriented environment.
  • Strong problem-solving skills and a proactive approach to system optimization.
  • Ability to manage multiple priorities in a fast-paced, onsite environment.

Qualifications

  • Experience building custom integrations between cloud-based systems using APIs.
  • Experience with open-source ML tools and frameworks.
  • Ability to translate complex business needs into scalable technical architectures.

Special Requirements

All 5 days Onsite


Compensation & Location

Salary: $170,000 – $230,000 per year (Estimated)

Location: Austin, TX


Recruiter / Company – Contact Information

Recruiter / Employer: Tanisha Systems Inc.

Email: Himanshu.Pandey@tanishasystems.com


Interested in this position?
Apply via Email

Recruiter Notice:
To remove this job posting, please send an email from
Himanshu.Pandey@tanishasystems.com with the subject:

DELETE_JOB_ID_233

to delete@join-this.com.