Job ID: JOB_ID_9249
Role: Resiliency and Recovery Engineer – Tech Lead
We are looking for a senior, hands-on Resiliency and Recovery Engineer to join our team. This role is critical for improving production resiliency and recovery outcomes across critical services and payment rails. The ideal candidate will have a strong background in high-availability environments and a proven track record of driving measurable improvements in service recovery, alert coverage, automation, and release safety.
Key Responsibilities:
- Work across all MMC payment rails to develop faster, more repeatable resiliency and recovery processes.
- Identify resiliency gaps based on incident patterns and recurring failures; turn findings into prioritized remediation work.
- Build/strengthen monitoring, alerting, and dashboards that are actually used by engineers and leadership.
- Create runbooks and automate recovery actions to reduce manual toil and human error during incidents.
- Improve release safety and rollback/fallback readiness (clear, repeatable cutback procedures).
- Support SQL reliability efforts (SQL Server 2022 focus) in partnership with DB/infrastructure teams.
- Owns backlog, prioritization, design reviews, and cross-team coordination (Ops/Product/Tech).
- Runs weekly standup + prepares bi-weekly exec readout.
- Integrate resilience testing into CI/CD pipelines and DevOps workflows.
- Conduct chaos engineering experiments (failure injections, game days) to proactively uncover system weaknesses and validate recovery processes.
- Document and share resiliency best practices; mentor and train engineering teams.
Must-Have Qualifications:
- Proven experience in high-availability, high-transaction environments (preferably payments or financial services).
- Strong background in production resiliency and recovery (recovery execution, runbooks/playbooks, RCA mindset).
- Incident pattern analysis + MTTR baselines (P2 Major/Minor) and recurring failure taxonomy.
- Senior-level observability expertise: dashboards, monitors, and alerts (Datadog preferred).
- Experience with Splunk, Datadog, SQLs, JQL Jira Query language, Gitlab.
- Experience of CI / CD metrics and generating code quality, changes, testing automation executive reports from Gitlab.
- Understand quality of stories, metrics, monitoring experiences – help get data to showcase deficiencies.
- Senior CI/CD experience: pipeline design/operation, release safety patterns, and rollback readiness.
- Experience using metrics and monitoring data to identify and communicate deficiencies.
- Automation skills: Python and/or PowerShell (or equivalent) for building repeatable recovery workflows and operational tooling.
- Kubernetes/container platform production troubleshooting.
- Experience with identity/credentials/certificate & secret-rotation resilience.
- Batch/scheduler/job-execution reliability.
- Distributed integration failure-handling (timeouts, retries, backpressure, idempotency, duplicate prevention, and reconciliation).
Nice-to-have (differentiators):
- Experience with SRE-style reliability practices (SLO/SLI thinking, error budgets, operational metrics).
- Experience with failover / DC flip / active-active or active-passive recovery concepts and scenario-based runbooks.
- Cloud Engineering (Azure, AWS).
- DevOps tools expertise (Jenkins, Terraform, Sonar Cube, Helm Charts).
- Network & traffic-management incident triage.
Contract Details:
This is a 6-month contract role located onsite in Charlotte, NC.
Special Requirements
Location: ONSITE- Charlotte NC. Duration: 6 months. Employment Type: Contract.
Compensation & Location
Salary: $130,000 – $170,000 per year (Estimated)
Location: Charlotte, NC
Recruiter / Company – Contact Information
Email: habh.s@smartitframe.co
Recruiter Notice:
To remove this job posting, please send an email from
habh.s@smartitframe.co with the subject:
DELETE_JOB_ID_9249