NEWPosted 4 hours ago

Job ID: JOB_ID_3365

Key Responsibilities:

  • Develop and maintain monitoring and alerting solutions to support the reliability and availability of enterprise applications.
  • Enhance existing alerting strategies to ensure alerts are actionable and reduce dependency on continuous manual monitoring.
  • Implement proactive detection mechanisms to identify potential performance or stability issues before they affect users.
  • Build, enhance, and maintain real-time monitoring dashboards to provide visibility into system health and performance.
  • Define, track, and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to measure service reliability.
  • Collaborate with engineering and operations teams to support incident investigation, triage, and resolution.
  • Perform root cause analysis following incidents and contribute to remediation and continuous improvement initiatives.
  • Identify operational inefficiencies and implement automation solutions to reduce manual tasks and operational toil.
  • Contribute to improving observability frameworks, helping engineering teams adopt better monitoring practices.
  • Assist in the development of operational runbooks and reliability best practices.

Required Qualifications:

  • 5+ years of experience working in Site Reliability Engineering or DevOps.
  • Strong experience with monitoring and alerting platforms, particularly Splunk.
  • Demonstrated experience implementing automation solutions to improve operational efficiency and system uptime.
  • Hands-on experience diagnosing system issues, identifying root causes, and implementing long-term solutions.
  • Experience designing and maintaining monitoring dashboards and operational metrics.
  • Practical knowledge of SLI/SLO frameworks and reliability engineering principles.
  • Experience participating in incident response and service restoration activities.
  • Strong scripting or automation skills to support operational improvements.
  • Ability to identify opportunities for reducing operational toil through automation and improved monitoring strategies.

Core Competencies:

  • Site Reliability Engineering (SRE) practices
  • Monitoring, alerting, and observability
  • Splunk administration and dashboard development
  • Automation and scripting
  • Operational management and service reliability
  • Risk assessment and mitigation

Special Requirements

Visa: GC/USC, LinkedIn is must, Only Local candidate


Compensation & Location

Salary: $120,000 – $160,000 per year (Estimated)

Location: Atlanta, GA


Recruiter / Company – Contact Information

Email: sunny@sourceinfotech.com


Interested in this position?
Apply via Email

Recruiter Notice:
To remove this job posting, please send an email from
sunny@sourceinfotech.com with the subject:

DELETE_JOB_ID_3365

to delete@join-this.com.