Job ID: JOB_ID_5125
About the Role:
We are seeking a highly experienced Observability Engineer with over 10 years of dedicated experience to join our client’s team. This role is crucial for ensuring the availability, reliability, and performance of our systems. You will be responsible for designing, implementing, and maintaining robust observability solutions, leveraging a variety of cutting-edge tools and technologies.
Key Responsibilities:
- Design, implement, and manage comprehensive observability solutions, including monitoring, logging, and tracing.
- Develop and maintain dashboards and visualizations using Grafana, OpenSearch Dashboards, and Splunk to provide real-time insights into system health and performance.
- Write complex queries using OpenSearch Query DSL and Splunk Search Processing Language (SPL) to extract, filter, and aggregate data for analysis and troubleshooting.
- Configure and manage alerting systems in Grafana, Splunk, and OpenSearch to proactively identify and respond to potential issues.
- Tune alert thresholds and notification channels to minimize noise and ensure timely resolution of incidents.
- Develop Python scripts for data extraction, transformation, and automation of monitoring and reporting workflows.
- Create and maintain Airflow DAGs for data pipelines, ensuring efficient data flow from observability platforms.
- Collaborate closely with SRE, DevOps, and platform engineering teams to integrate observability solutions into the development lifecycle.
- Document dashboards, alert logic, and query definitions to ensure knowledge sharing and maintainability.
- Troubleshoot data anomalies, missing fields, and inconsistent indexing patterns to ensure data integrity.
- Interpret and work with time-series data, logs, and event-driven datasets to identify performance bottlenecks and areas for improvement.
- Familiarity with REST APIs, JSON data structures, and basic HTTP-based data retrieval.
Mandatory Skills:
- Core Technical Skills: OpenSearch / ELK / Elasticsearch, Python, Grafana, Splunk. Strong understanding of system availability, reliability, and performance metrics. Ability to interpret and work with time-series data, logs, and event-driven datasets. Familiarity with REST APIs, JSON data structures, and basic HTTP-based data retrieval.
- Monitoring & Observability Tools: Experience building interactive dashboards, panels, and visualizations in Grafana. Experience configuring data sources, variables, and templating. Understanding of Grafana’s query editors (PromQL, OpenSearch DSL inside Grafana, etc.). Experience building dashboards, visualizations, and saved queries in OpenSearch. Ability to navigate OpenSearch indices, mappings, and data structures. Proficiency with Splunk Search Processing Language (SPL). Ability to build dashboards and create drill-downs in Splunk.
- Data Querying & Analysis: Ability to write OpenSearch Query DSL to retrieve, filter, and aggregate data. Ability to write complex Splunk SPL queries with pipes, stats, eval, regex, and time functions. Experience writing aggregations (e.g., terms, date histogram, sum, avg, percentiles). Ability to troubleshoot data anomalies, missing fields, or inconsistent indexing patterns.
- Alerting & Threshold Monitoring: Ability to create and configure alerts in Grafana Alerting, Splunk Alerts, OpenSearch Alerting / Anomaly Detection. Understanding of threshold-based alerting, time-windowed queries, and notification channels (email, webhook, PagerDuty, Slack, etc.). Experience tuning alerts to avoid noise (e.g., use of cool-down periods, aggregation windows).
Bonus / Preferred Skills:
- Python Development: Ability to write Python scripts for data extraction, transformation, and automation. Familiarity with Requests for API calls, OpenSearch/Elasticsearch libraries, working with JSON responses and pandas dataframes. Experience automating monitoring/reporting workflows.
- Airflow Experience: Creating and maintaining Airflow DAGs. Understanding of Airflow components: Operators, Tasks, Scheduling, Dependencies. Ability to build pipelines that extract data from OpenSearch or other observability platforms.
Soft Skills & Work Practices:
- Strong analytical and troubleshooting skills.
- Ability to document dashboards, alert logic, and query definitions.
- Ability to collaborate with SRE, DevOps, and platform engineering teams.
- Comfortable working in a fast-paced environment with multiple tools and data sources.
Special Requirements
Local candidates highly preferred. Multiple locations: Phoenix, AZ / Salt Lake City, UT / Sunrise, FL.
Compensation & Location
Salary: $120,000 – $160,000 per year (Estimated)
Location: Phoenix, AZ
Recruiter / Company – Contact Information
Email: aresan@smarttechlink.com
Recruiter Notice:
To remove this job posting, please send an email from
aresan@smarttechlink.com with the subject:
DELETE_JOB_ID_5125