NEWPosted 23 hours ago

Job ID: 3179874

We are seeking a highly experienced and exceptionally skilled Senior Site Reliability Engineer (SRE) with over 12 years of expertise to join our dynamic team. This pivotal role demands a professional with a profound understanding of maintaining robust, scalable, and highly available production systems across cloud environments. The ideal candidate will possess a strong background in AWS, Kubernetes, and Linux, coupled with extensive scripting capabilities in Python. This position offers a unique opportunity to contribute to critical infrastructure, optimize performance, and ensure the seamless operation of complex distributed systems.

Key Responsibilities:

  • Design, implement, and maintain highly available, scalable, and fault-tolerant production systems on AWS and/or GCP platforms.
  • Lead the management, maintenance, and debugging of Kubernetes clusters, ensuring optimal performance and reliability for containerized applications developed in Golang, Java, and Python.
  • Develop and implement automation scripts using Python and other scripting languages to streamline operational tasks, improve system efficiency, and reduce manual intervention.
  • Proactively monitor system health, performance, and security using a variety of advanced monitoring solutions such as CloudWatch, Stackdriver, Prometheus, Thanos, Graphite, Grafana, ELK, Alert Logic, and Datadog.
  • Implement and manage robust logging service solutions to facilitate effective troubleshooting, root cause analysis, and compliance auditing.
  • Drive continuous integration and continuous delivery (CI/CD) practices, leveraging tools like Jenkins, Travis CI, and CircleCI to ensure rapid and reliable software deployments.
  • Utilize infrastructure as code (IaC) software, including Terraform, AWS Cloud Development Kit (CDK), Google Cloud Deployment Manager, and CloudFormation, to manage and provision infrastructure efficiently.
  • Collaborate with development teams to optimize application performance, identify bottlenecks, and implement solutions for improved reliability and scalability.
  • Provide expert guidance and support for critical infrastructure components such as Kafka, Spark, Storm, Cassandra, ElasticSearch, PostgreSQL, Redis (Elasticache), Zookeeper, Nginx, and AWS S3/GCP GS.
  • Participate in on-call rotations to respond to and resolve critical incidents, ensuring minimal downtime and rapid recovery.
  • Conduct post-incident reviews to identify areas for improvement and implement preventative measures.
  • Stay abreast of emerging technologies and best practices in SRE, DevOps, and cloud computing, continuously seeking opportunities to enhance our infrastructure and processes.

Required Skills and Qualifications:

  • Minimum of 12 years of hands-on experience in Site Reliability Engineering or a similar role focused on production operations and infrastructure.
  • Very strong expertise in AWS cloud services, including EC2, S3, RDS, Lambda, VPC, IAM, and more. Experience with GCP is a significant plus.
  • Extensive experience with Kubernetes for container orchestration, including cluster setup, management, and troubleshooting.
  • Proficiency in Linux operating systems, including system administration, shell scripting, and performance tuning.
  • Advanced scripting skills in Python are essential for automation and tooling development.
  • Demonstrated experience maintaining production systems with high availability and disaster recovery strategies.
  • Solid understanding of distributed systems, microservices architectures, and their operational challenges.
  • Familiarity with various data stores and messaging systems such as Kafka, Spark, Storm, Cassandra, ElasticSearch, PostgreSQL, Redis (Elasticache), and Zookeeper.
  • Experience with web servers and proxies like Nginx.
  • Proficiency with Infrastructure as Code tools (Terraform, CloudFormation, Google Cloud Deployment Manager).
  • Strong background in CI/CD pipelines and tools (Jenkins, Travis CI, CircleCI).
  • Expertise in monitoring and logging solutions (CloudWatch, Stackdriver, Prometheus, Grafana, ELK, Datadog).
  • Excellent problem-solving abilities, with a methodical approach to debugging complex issues.
  • Strong communication and collaboration skills, with the ability to work effectively in a fast-paced, remote-friendly environment.
  • Ability to travel for in-person interviews in Plano, TX, or California as required.

This is a challenging yet rewarding opportunity for a seasoned SRE to contribute to cutting-edge technology and make a substantial impact on our infrastructure’s reliability and performance.


Special Requirements

In-person interview required in Plano, TX or California.


Compensation & Location

Salary: $140,000 – $190,000 per year (Estimated)

Location: Sunnyvale, CA


Recruiter / Company – Contact Information

Recruiter / Employer: Tanisha Systems Inc

Email: abhi.b@tanishasystems.com


Interested in this position?
Apply via Email

Recruiter Notice:
To remove this job posting, please send an email from
abhi.b@tanishasystems.com with the subject:

DELETE_3179874

to delete@join-this.com.