NEWPosted 1 hour ago

Job ID: JOB_ID_9249

Role: Resiliency and Recovery Engineer – Tech Lead

We are looking for a senior, hands-on Resiliency and Recovery Engineer to join our team. This role is critical for improving production resiliency and recovery outcomes across critical services and payment rails. The ideal candidate will have a strong background in high-availability environments and a proven track record of driving measurable improvements in service recovery, alert coverage, automation, and release safety.

Key Responsibilities:

Work across all MMC payment rails to develop faster, more repeatable resiliency and recovery processes.
Identify resiliency gaps based on incident patterns and recurring failures; turn findings into prioritized remediation work.
Build/strengthen monitoring, alerting, and dashboards that are actually used by engineers and leadership.
Create runbooks and automate recovery actions to reduce manual toil and human error during incidents.
Improve release safety and rollback/fallback readiness (clear, repeatable cutback procedures).
Support SQL reliability efforts (SQL Server 2022 focus) in partnership with DB/infrastructure teams.
Owns backlog, prioritization, design reviews, and cross-team coordination (Ops/Product/Tech).
Runs weekly standup + prepares bi-weekly exec readout.
Integrate resilience testing into CI/CD pipelines and DevOps workflows.
Conduct chaos engineering experiments (failure injections, game days) to proactively uncover system weaknesses and validate recovery processes.
Document and share resiliency best practices; mentor and train engineering teams.

Must-Have Qualifications:

Proven experience in high-availability, high-transaction environments (preferably payments or financial services).
Strong background in production resiliency and recovery (recovery execution, runbooks/playbooks, RCA mindset).
Incident pattern analysis + MTTR baselines (P2 Major/Minor) and recurring failure taxonomy.
Senior-level observability expertise: dashboards, monitors, and alerts (Datadog preferred).
Experience with Splunk, Datadog, SQLs, JQL Jira Query language, Gitlab.
Experience of CI / CD metrics and generating code quality, changes, testing automation executive reports from Gitlab.
Understand quality of stories, metrics, monitoring experiences – help get data to showcase deficiencies.
Senior CI/CD experience: pipeline design/operation, release safety patterns, and rollback readiness.
Experience using metrics and monitoring data to identify and communicate deficiencies.
Automation skills: Python and/or PowerShell (or equivalent) for building repeatable recovery workflows and operational tooling.
Kubernetes/container platform production troubleshooting.
Experience with identity/credentials/certificate & secret-rotation resilience.
Batch/scheduler/job-execution reliability.
Distributed integration failure-handling (timeouts, retries, backpressure, idempotency, duplicate prevention, and reconciliation).

Nice-to-have (differentiators):

Experience with SRE-style reliability practices (SLO/SLI thinking, error budgets, operational metrics).
Experience with failover / DC flip / active-active or active-passive recovery concepts and scenario-based runbooks.
Cloud Engineering (Azure, AWS).
DevOps tools expertise (Jenkins, Terraform, Sonar Cube, Helm Charts).
Network & traffic-management incident triage.

Contract Details:

This is a 6-month contract role located onsite in Charlotte, NC.

Special Requirements

Location: ONSITE- Charlotte NC. Duration: 6 months. Employment Type: Contract.

Compensation & Location

Salary: $130,000 – $170,000 per year (Estimated)

Location: Charlotte, NC

Recruiter / Company – Contact Information

Email: habh.s@smartitframe.co

Interested in this position?
Apply via Email

Recruiter Notice:
To remove this job posting, please send an email from
habh.s@smartitframe.co with the subject:

DELETE_JOB_ID_9249

to delete@join-this.com.