NEWPosted 1 hour ago

Job ID: JOB_ID_9249

Role: Resiliency and Recovery Engineer – Tech Lead

We are looking for a senior, hands-on Resiliency and Recovery Engineer to join our team. This role is critical for improving production resiliency and recovery outcomes across critical services and payment rails. The ideal candidate will have a strong background in high-availability environments and a proven track record of driving measurable improvements in service recovery, alert coverage, automation, and release safety.

Key Responsibilities:

  • Work across all MMC payment rails to develop faster, more repeatable resiliency and recovery processes.
  • Identify resiliency gaps based on incident patterns and recurring failures; turn findings into prioritized remediation work.
  • Build/strengthen monitoring, alerting, and dashboards that are actually used by engineers and leadership.
  • Create runbooks and automate recovery actions to reduce manual toil and human error during incidents.
  • Improve release safety and rollback/fallback readiness (clear, repeatable cutback procedures).
  • Support SQL reliability efforts (SQL Server 2022 focus) in partnership with DB/infrastructure teams.
  • Owns backlog, prioritization, design reviews, and cross-team coordination (Ops/Product/Tech).
  • Runs weekly standup + prepares bi-weekly exec readout.
  • Integrate resilience testing into CI/CD pipelines and DevOps workflows.
  • Conduct chaos engineering experiments (failure injections, game days) to proactively uncover system weaknesses and validate recovery processes.
  • Document and share resiliency best practices; mentor and train engineering teams.

Must-Have Qualifications:

  • Proven experience in high-availability, high-transaction environments (preferably payments or financial services).
  • Strong background in production resiliency and recovery (recovery execution, runbooks/playbooks, RCA mindset).
  • Incident pattern analysis + MTTR baselines (P2 Major/Minor) and recurring failure taxonomy.
  • Senior-level observability expertise: dashboards, monitors, and alerts (Datadog preferred).
  • Experience with Splunk, Datadog, SQLs, JQL Jira Query language, Gitlab.
  • Experience of CI / CD metrics and generating code quality, changes, testing automation executive reports from Gitlab.
  • Understand quality of stories, metrics, monitoring experiences – help get data to showcase deficiencies.
  • Senior CI/CD experience: pipeline design/operation, release safety patterns, and rollback readiness.
  • Experience using metrics and monitoring data to identify and communicate deficiencies.
  • Automation skills: Python and/or PowerShell (or equivalent) for building repeatable recovery workflows and operational tooling.
  • Kubernetes/container platform production troubleshooting.
  • Experience with identity/credentials/certificate & secret-rotation resilience.
  • Batch/scheduler/job-execution reliability.
  • Distributed integration failure-handling (timeouts, retries, backpressure, idempotency, duplicate prevention, and reconciliation).

Nice-to-have (differentiators):

  • Experience with SRE-style reliability practices (SLO/SLI thinking, error budgets, operational metrics).
  • Experience with failover / DC flip / active-active or active-passive recovery concepts and scenario-based runbooks.
  • Cloud Engineering (Azure, AWS).
  • DevOps tools expertise (Jenkins, Terraform, Sonar Cube, Helm Charts).
  • Network & traffic-management incident triage.

Contract Details:

This is a 6-month contract role located onsite in Charlotte, NC.


Special Requirements

Location: ONSITE- Charlotte NC. Duration: 6 months. Employment Type: Contract.


Compensation & Location

Salary: $130,000 – $170,000 per year (Estimated)

Location: Charlotte, NC


Recruiter / Company – Contact Information

Email: habh.s@smartitframe.co


Interested in this position?
Apply via Email

Recruiter Notice:
To remove this job posting, please send an email from
habh.s@smartitframe.co with the subject:

DELETE_JOB_ID_9249

to delete@join-this.com.