NEWPosted 20 hours ago

Job ID: 3179744

Overview of the Site Reliability Engineer (SRE) ML Platform Role

K-Tek Resourcing is actively seeking a highly experienced and dedicated Site Reliability Engineer (SRE) with a specialized focus on Machine Learning (ML) platforms, often referred to as an MLOps Engineer. This critical, long-term onsite position is available in both Austin, TX, and Sunnyvale, CA, offering a unique opportunity to contribute to the reliability, scalability, and performance of cutting-edge Machine Learning infrastructure. The successful candidate will be instrumental in building and operating robust, production-grade ML systems, leveraging modern cloud-native technologies and best practices in MLOps.

This role demands a professional with a strong background in MLOps, cloud architecture, containerization, and continuous integration/continuous deployment (CI/CD) pipelines. A solid understanding of various ML models, including Large Language Models (LLMs), is essential. You will collaborate closely with data scientists, ML engineers, and other software development teams to design, deploy, and meticulously maintain resilient ML pipelines and services that drive our client’s innovative solutions.

Key Responsibilities and Accountabilities

Design, deploy, and meticulously maintain scalable ML platforms utilizing industry-leading technologies such as Kubernetes, Docker, and comprehensive cloud services, with a primary focus on AWS.
Construct and operate end-to-end MLOps pipelines, encompassing all stages from model training and rigorous validation to seamless deployment and continuous monitoring in production environments.
Ensure the highest levels of availability, reliability, and performance for all ML production systems, proactively identifying and resolving potential bottlenecks.
Develop sophisticated automation tools and services primarily using Python to streamline operations and enhance efficiency across the ML lifecycle.
Implement and manage robust CI/CD pipelines specifically tailored for ML workloads and microservices architectures, ensuring rapid and reliable deployments.
Provide expert support for diverse ML workloads, including both traditional ML models and advanced Large Language Models (LLMs).
Collaborate effectively with data scientists to successfully productionize their models, optimize existing workflows, and integrate new capabilities.
Administer complex Linux systems and expertly troubleshoot infrastructure issues to maintain operational stability.
Design and implement cloud-native microservices and APIs specifically for ML applications, ensuring secure and efficient data flow.
Manage and integrate various data stores, including MongoDB, and search platforms like Apache Solr, to support ML data requirements.
Implement comprehensive monitoring, alerting, logging, and benchmarking solutions for all ML systems to ensure proactive issue detection and performance optimization.
Translate intricate business requirements into clear, actionable technical solutions that align with strategic objectives.
Actively contribute to the development and enforcement of best practices related to software testing, security protocols, and overall operational excellence within the ML ecosystem.

Required Experience and Technical Expertise

A minimum of 6+ years of hands-on, progressive experience in MLOps, Site Reliability Engineering (SRE), or Platform Engineering roles.
Demonstrated strong proficiency in Python programming, with a focus on scripting, automation, and ML-related development.
Extensive practical experience with Kubernetes and other containerized environments, including Docker.
Solid foundational and practical knowledge of AWS cloud platforms; experience with Azure or GCP is also highly valued.
Proven experience with MongoDB for data storage and management.
Strong Linux administration skills, including scripting and troubleshooting.
Hands-on experience with microservices architectures and their deployment.
Proficiency in implementing and managing CI/CD pipelines for software and ML projects.
Working knowledge of various ML models and a practical understanding of Large Language Models (LLMs).
Experience in productionizing ML systems built with a variety of open-source tools and frameworks.

Preferred Qualifications

Prior experience with workflow orchestration tools such as Kubeflow, Apache Airflow, or Argo Workflows.
Experience in building custom integrations between various cloud-based systems using REST APIs.
A strong understanding of software testing methodologies, performance benchmarking, and continuous integration principles.
Exposure to advanced ML methodology and best practices in model development and deployment.
Demonstrated ability to design and implement comprehensive cloud-based ML solutions from concept to production.
Excellent communication skills, both written and verbal, with the ability to collaborate effectively across diverse technical and non-technical teams.

This is an exceptional opportunity for a seasoned SRE/MLOps professional to make a significant impact on critical ML infrastructure within a dynamic and innovative environment. Join us in shaping the future of reliable and scalable machine learning platforms.

Special Requirements

Strong hands-on ML Platform Experience required. Visa sponsorship not mentioned, assume US work authorization required. Interview mode not specified.

Compensation & Location

Salary: $140,000 – $200,000 per year (Estimated)

Location: Austin, TX

Recruiter / Company – Contact Information

Recruiter / Employer: K-Tek Resourcing LLC

Email: Pradeep.bhondve@ktekresourcing.com

Interested in this position?
Apply via Email

Recruiter Notice:
To remove this job posting, please send an email from
Pradeep.bhondve@ktekresourcing.com with the subject:

DELETE_3179744

to delete@join-this.com.