Job ID: 3179874
We are seeking a highly experienced and exceptionally skilled Senior Site Reliability Engineer (SRE) with over 12 years of expertise to join our dynamic team. This pivotal role demands a professional with a profound understanding of maintaining robust, scalable, and highly available production systems across cloud environments. The ideal candidate will possess a strong background in AWS, Kubernetes, and Linux, coupled with extensive scripting capabilities in Python. This position offers a unique opportunity to contribute to critical infrastructure, optimize performance, and ensure the seamless operation of complex distributed systems.
Key Responsibilities:
- Design, implement, and maintain highly available, scalable, and fault-tolerant production systems on AWS and/or GCP platforms.
- Lead the management, maintenance, and debugging of Kubernetes clusters, ensuring optimal performance and reliability for containerized applications developed in Golang, Java, and Python.
- Develop and implement automation scripts using Python and other scripting languages to streamline operational tasks, improve system efficiency, and reduce manual intervention.
- Proactively monitor system health, performance, and security using a variety of advanced monitoring solutions such as CloudWatch, Stackdriver, Prometheus, Thanos, Graphite, Grafana, ELK, Alert Logic, and Datadog.
- Implement and manage robust logging service solutions to facilitate effective troubleshooting, root cause analysis, and compliance auditing.
- Drive continuous integration and continuous delivery (CI/CD) practices, leveraging tools like Jenkins, Travis CI, and CircleCI to ensure rapid and reliable software deployments.
- Utilize infrastructure as code (IaC) software, including Terraform, AWS Cloud Development Kit (CDK), Google Cloud Deployment Manager, and CloudFormation, to manage and provision infrastructure efficiently.
- Collaborate with development teams to optimize application performance, identify bottlenecks, and implement solutions for improved reliability and scalability.
- Provide expert guidance and support for critical infrastructure components such as Kafka, Spark, Storm, Cassandra, ElasticSearch, PostgreSQL, Redis (Elasticache), Zookeeper, Nginx, and AWS S3/GCP GS.
- Participate in on-call rotations to respond to and resolve critical incidents, ensuring minimal downtime and rapid recovery.
- Conduct post-incident reviews to identify areas for improvement and implement preventative measures.
- Stay abreast of emerging technologies and best practices in SRE, DevOps, and cloud computing, continuously seeking opportunities to enhance our infrastructure and processes.
Required Skills and Qualifications:
- Minimum of 12 years of hands-on experience in Site Reliability Engineering or a similar role focused on production operations and infrastructure.
- Very strong expertise in AWS cloud services, including EC2, S3, RDS, Lambda, VPC, IAM, and more. Experience with GCP is a significant plus.
- Extensive experience with Kubernetes for container orchestration, including cluster setup, management, and troubleshooting.
- Proficiency in Linux operating systems, including system administration, shell scripting, and performance tuning.
- Advanced scripting skills in Python are essential for automation and tooling development.
- Demonstrated experience maintaining production systems with high availability and disaster recovery strategies.
- Solid understanding of distributed systems, microservices architectures, and their operational challenges.
- Familiarity with various data stores and messaging systems such as Kafka, Spark, Storm, Cassandra, ElasticSearch, PostgreSQL, Redis (Elasticache), and Zookeeper.
- Experience with web servers and proxies like Nginx.
- Proficiency with Infrastructure as Code tools (Terraform, CloudFormation, Google Cloud Deployment Manager).
- Strong background in CI/CD pipelines and tools (Jenkins, Travis CI, CircleCI).
- Expertise in monitoring and logging solutions (CloudWatch, Stackdriver, Prometheus, Grafana, ELK, Datadog).
- Excellent problem-solving abilities, with a methodical approach to debugging complex issues.
- Strong communication and collaboration skills, with the ability to work effectively in a fast-paced, remote-friendly environment.
- Ability to travel for in-person interviews in Plano, TX, or California as required.
This is a challenging yet rewarding opportunity for a seasoned SRE to contribute to cutting-edge technology and make a substantial impact on our infrastructure’s reliability and performance.
Special Requirements
In-person interview required in Plano, TX or California.
Compensation & Location
Salary: $140,000 – $190,000 per year (Estimated)
Location: Sunnyvale, CA
Recruiter / Company – Contact Information
Recruiter / Employer: Tanisha Systems Inc
Email: abhi.b@tanishasystems.com
Recruiter Notice:
To remove this job posting, please send an email from
abhi.b@tanishasystems.com with the subject:
DELETE_3179874