Job ID: JOB_ID_8350
MLOps L2 Support Engineer Role Details
We are seeking a highly skilled MLOps L2 Support Engineer to provide critical 24/7 production support for our machine learning (ML) and data pipelines. This role demands on-call availability, including weekends, to ensure the continuous high availability and reliability of our ML workflows. The successful candidate will leverage their expertise with Dataiku, AWS, CI/CD pipelines, and containerized deployments to maintain and troubleshoot ML models in a production environment.
Key Responsibilities:
- Incident Management & Support: Provide L2 support for MLOps production environments, ensuring optimal uptime and reliability. Troubleshoot complex ML pipelines, data processing jobs, and API issues. Monitor logs, alerts, and performance metrics using tools like Dataiku, Prometheus, Grafana, or AWS CloudWatch. Perform root cause analysis (RCA) and resolve incidents within defined Service Level Agreements (SLAs). Escalate unresolved issues to L3 engineering teams as necessary.
- Dataiku Platform Management: Manage Dataiku DSS workflows, troubleshoot job failures, and optimize performance. Monitor and support Dataiku plugins, APIs, and automation scenarios. Collaborate effectively with Data Scientists and Data Engineers to debug ML model deployments. Perform version control and CI/CD integration for Dataiku projects.
- Deployment & Automation: Support CI/CD pipelines for ML model deployment using tools such as Bamboo and Bitbucket. Deploy ML models and data pipelines using Docker, Kubernetes, or Dataiku Flow. Automate monitoring and alerting for ML model drift, data quality issues, and performance degradation.
- Cloud & Infrastructure Support: Monitor AWS-based ML workloads, including SageMaker, Lambda, ECS, S3, and RDS. Manage storage and compute resources for ML workflows. Support database connections, data ingestion, and ETL pipelines using technologies like SQL, Spark, and Kafka.
- Security & Compliance: Ensure secure access control for ML models and data pipelines. Support audit, compliance, and governance requirements for Dataiku and MLOps workflows. Respond promptly to security incidents related to ML models and data access.
Required Skills & Experience:
- Experience: 5+ years in MLOps, Data Engineering, or Production Support.
- Dataiku DSS: Strong experience with Dataiku workflows, scenarios, plugins, and APIs.
- Cloud Platforms: Hands-on experience with AWS ML services such as SageMaker, Lambda, S3, RDS, ECS, and IAM.
- CI/CD & Automation: Familiarity with CI/CD tools like GitHub Actions, Jenkins, or Terraform.
- Scripting & Debugging: Proficiency in Python, Bash, and SQL for automation and debugging.
- Monitoring & Logging: Experience with monitoring tools like Prometheus, Grafana, CloudWatch, or the ELK Stack.
- Incident Response: Demonstrated ability to handle on-call support, weekend shifts, and SLA-based issue resolution.
Preferred Qualifications:
- Containerization: Experience with Docker, Kubernetes, or OpenShift.
- ML Model Deployment: Familiarity with TensorFlow Serving, MLflow, or Dataiku Model API.
- Data Engineering: Experience with Spark, Databricks, Kafka, or Snowflake.
- Certifications: ITIL Foundation, AWS ML certifications, or Dataiku certification.
Work Schedule & On-Call Requirements:
This role involves rotational on-call support, including weekends and nights. Shift-based monitoring for ML workflows and Dataiku jobs is required. Flexibility in work schedule is necessary to handle production incidents and critical ML model failures.
Senior Data Engineer Role Details
We are looking for a Senior Data Engineer to join our team and play a hands-on role in designing, building, and operating high-performance batch and streaming data platforms. You will be responsible for:
Key Responsibilities:
- Design, develop, and maintain large-scale batch and streaming pipelines using PySpark and Python.
- Build real-time and near real-time streaming applications with stateful processing, windowing, and checkpointing.
- Develop production-grade Python microservices for complex data transformations and business logic.
- Design and manage modern data lake architectures using Apache Iceberg on AWS S3, implementing schema evolution, partitioning, compaction, and time travel.
- Develop and deploy pipelines across various AWS services including S3, EMR, Glue, Lambda, Athena, Redshift, and Aurora.
- Optimize Spark workloads for performance, scalability, and cost efficiency.
- Implement robust monitoring, logging, alerting, and recovery mechanisms for production operations.
- Contribute to CI/CD pipelines, actively participate in architecture discussions, and uphold engineering best practices.
What You’ll Bring:
- Bachelors or Masters degree in Computer Science, Engineering, or a related discipline.
- Over 10+ years of experience in IT with strong hands-on expertise in PySpark, Spark SQL, and distributed data processing.
- Advanced proficiency in Python for building scalable, production-grade data solutions and microservices.
- Proven experience building and running Kafka-based streaming applications in production environments.
- Deep understanding of streaming fundamentals, including stateful processing and fault tolerance.
- Hands-on experience with Apache Iceberg in production data lake environments.
- Solid experience with AWS data services (S3, EMR, Glue, Lambda, Redshift, Aurora).
- Advanced SQL skills and strong knowledge of data modeling and modern data lake architectures.
- Strong troubleshooting skills in distributed data systems with a focus on reliability and performance.
Special Requirements
On-Call & Weekend Support required for MLOps L2 Support Engineer. Hybrid work model with 3 days WFO mandatory for Senior Data Engineer. Local candidates preferred for Senior Data Engineer role.
Compensation & Location
Salary: $60 – $80 per year
Location: Reading, PA
Recruiter / Company – Contact Information
Email: anth.kanithi@3sbc.com
Recruiter Notice:
To remove this job posting, please send an email from
anth.kanithi@3sbc.com with the subject:
DELETE_JOB_ID_8350