NEWPosted 23 hours ago

Job ID: 3179874

We are seeking a highly experienced and exceptionally skilled Senior Site Reliability Engineer (SRE) with over 12 years of expertise to join our dynamic team. This pivotal role demands a professional with a profound understanding of maintaining robust, scalable, and highly available production systems across cloud environments. The ideal candidate will possess a strong background in AWS, Kubernetes, and Linux, coupled with extensive scripting capabilities in Python. This position offers a unique opportunity to contribute to critical infrastructure, optimize performance, and ensure the seamless operation of complex distributed systems.

Key Responsibilities:

Design, implement, and maintain highly available, scalable, and fault-tolerant production systems on AWS and/or GCP platforms.
Lead the management, maintenance, and debugging of Kubernetes clusters, ensuring optimal performance and reliability for containerized applications developed in Golang, Java, and Python.
Develop and implement automation scripts using Python and other scripting languages to streamline operational tasks, improve system efficiency, and reduce manual intervention.
Proactively monitor system health, performance, and security using a variety of advanced monitoring solutions such as CloudWatch, Stackdriver, Prometheus, Thanos, Graphite, Grafana, ELK, Alert Logic, and Datadog.
Implement and manage robust logging service solutions to facilitate effective troubleshooting, root cause analysis, and compliance auditing.
Drive continuous integration and continuous delivery (CI/CD) practices, leveraging tools like Jenkins, Travis CI, and CircleCI to ensure rapid and reliable software deployments.
Utilize infrastructure as code (IaC) software, including Terraform, AWS Cloud Development Kit (CDK), Google Cloud Deployment Manager, and CloudFormation, to manage and provision infrastructure efficiently.
Collaborate with development teams to optimize application performance, identify bottlenecks, and implement solutions for improved reliability and scalability.
Provide expert guidance and support for critical infrastructure components such as Kafka, Spark, Storm, Cassandra, ElasticSearch, PostgreSQL, Redis (Elasticache), Zookeeper, Nginx, and AWS S3/GCP GS.
Participate in on-call rotations to respond to and resolve critical incidents, ensuring minimal downtime and rapid recovery.
Conduct post-incident reviews to identify areas for improvement and implement preventative measures.
Stay abreast of emerging technologies and best practices in SRE, DevOps, and cloud computing, continuously seeking opportunities to enhance our infrastructure and processes.

Required Skills and Qualifications:

Minimum of 12 years of hands-on experience in Site Reliability Engineering or a similar role focused on production operations and infrastructure.
Very strong expertise in AWS cloud services, including EC2, S3, RDS, Lambda, VPC, IAM, and more. Experience with GCP is a significant plus.
Extensive experience with Kubernetes for container orchestration, including cluster setup, management, and troubleshooting.
Proficiency in Linux operating systems, including system administration, shell scripting, and performance tuning.
Advanced scripting skills in Python are essential for automation and tooling development.
Demonstrated experience maintaining production systems with high availability and disaster recovery strategies.
Solid understanding of distributed systems, microservices architectures, and their operational challenges.
Familiarity with various data stores and messaging systems such as Kafka, Spark, Storm, Cassandra, ElasticSearch, PostgreSQL, Redis (Elasticache), and Zookeeper.
Experience with web servers and proxies like Nginx.
Proficiency with Infrastructure as Code tools (Terraform, CloudFormation, Google Cloud Deployment Manager).
Strong background in CI/CD pipelines and tools (Jenkins, Travis CI, CircleCI).
Expertise in monitoring and logging solutions (CloudWatch, Stackdriver, Prometheus, Grafana, ELK, Datadog).
Excellent problem-solving abilities, with a methodical approach to debugging complex issues.
Strong communication and collaboration skills, with the ability to work effectively in a fast-paced, remote-friendly environment.
Ability to travel for in-person interviews in Plano, TX, or California as required.

This is a challenging yet rewarding opportunity for a seasoned SRE to contribute to cutting-edge technology and make a substantial impact on our infrastructure’s reliability and performance.

Special Requirements

In-person interview required in Plano, TX or California.

Compensation & Location

Salary: $140,000 – $190,000 per year (Estimated)

Location: Sunnyvale, CA

Recruiter / Company – Contact Information

Recruiter / Employer: Tanisha Systems Inc

Email: abhi.b@tanishasystems.com

Interested in this position?
Apply via Email

Recruiter Notice:
To remove this job posting, please send an email from
abhi.b@tanishasystems.com with the subject:

DELETE_3179874

to delete@join-this.com.

Key Responsibilities:

Required Skills and Qualifications:

Special Requirements

Compensation & Location

Recruiter / Company – Contact Information

Related Jobs