NEWPosted 2 hours ago
Job ID: JOB_ID_3365
Key Responsibilities:
- Develop and maintain monitoring and alerting solutions to support the reliability and availability of enterprise applications.
- Enhance existing alerting strategies to ensure alerts are actionable and reduce dependency on continuous manual monitoring.
- Implement proactive detection mechanisms to identify potential performance or stability issues before they affect users.
- Build, enhance, and maintain real-time monitoring dashboards to provide visibility into system health and performance.
- Define, track, and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to measure service reliability.
- Collaborate with engineering and operations teams to support incident investigation, triage, and resolution.
- Perform root cause analysis following incidents and contribute to remediation and continuous improvement initiatives.
- Identify operational inefficiencies and implement automation solutions to reduce manual tasks and operational toil.
- Contribute to improving observability frameworks, helping engineering teams adopt better monitoring practices.
- Assist in the development of operational runbooks and reliability best practices.
Required Qualifications:
- 5+ years of experience working in Site Reliability Engineering or DevOps.
- Strong experience with monitoring and alerting platforms, particularly Splunk.
- Demonstrated experience implementing automation solutions to improve operational efficiency and system uptime.
- Hands-on experience diagnosing system issues, identifying root causes, and implementing long-term solutions.
- Experience designing and maintaining monitoring dashboards and operational metrics.
- Practical knowledge of SLI/SLO frameworks and reliability engineering principles.
- Experience participating in incident response and service restoration activities.
- Strong scripting or automation skills to support operational improvements.
- Ability to identify opportunities for reducing operational toil through automation and improved monitoring strategies.
Core Competencies:
- Site Reliability Engineering (SRE) practices
- Monitoring, alerting, and observability
- Splunk administration and dashboard development
- Automation and scripting
- Operational management and service reliability
- Risk assessment and mitigation
Special Requirements
Visa: GC/USC, LinkedIn is must, Only Local candidate
Compensation & Location
Salary: $120,000 – $160,000 per year (Estimated)
Location: Atlanta, GA
Recruiter / Company – Contact Information
Email: sunny@sourceinfotech.com
Recruiter Notice:
To remove this job posting, please send an email from
sunny@sourceinfotech.com with the subject:
DELETE_JOB_ID_3365