Job ID: JOB_ID_8945
Job Overview:
The Kafka Tier 3 Support Engineer is a senior technical role responsible for expert-level support, advanced troubleshooting, performance engineering, and platform stabilization of enterprise Apache Kafka environments. This role functions as the final technical escalation point for Kafka-related production incidents and is accountable for root cause analysis (RCA), complex remediation, and long-term prevention. The engineer works closely with Tier 2 operations, Platform Engineering, SRE teams, application teams, and vendor support (AWS MSK / Confluent / Cloud providers) to ensure Kafka remains a highly reliable, scalable, and secure streaming backbone.
Key Responsibilities:
- Tier 3 Incident Management & Escalation Support: Act as the highest technical escalation point for Kafka production incidents (Sev1 / Sev2). Lead deep troubleshooting across broker instability, controller elections, ISR shrinkage, underreplicated partitions, leader imbalance, producer/consumer failures, lag spikes, rebalance storms, disk, network, JVM, and request handler saturation. Provide hands-on remediation for complex issues, including partition reassignment, leader rebalance, broker configuration tuning, and throttle/quota strategies. Coordinate with vendor support during service incidents, providing logs, metrics, and forensic details. Guide Tier 2 teams during major incidents and validate restoration actions.
- Kafka Performance Engineering & Optimization: Analyze Kafka workloads for performance and scalability risks, including partition skew, hot partitions, inefficient producer batching/compression, consumer lag root cause analysis, and thread pool, I/O, and network bottlenecks. Recommend and validate topic design (partition count, replication factor, retention, compaction), producer and consumer configuration best practices, and quotas/enforcement/multitenant controls. Support onboarding of high-throughput or latency-sensitive workloads, ensuring Kafka is correctly sized and tuned.
- Platform Stability, Reliability & Resilience: Diagnose and resolve systemic Kafka stability issues such as repeated broker failures, controller instability (Zookeeper or KRaft), and recovery issues following failovers or maintenance events. Support resilience initiatives including MultiAZ cluster health validation, replication and DR strategies (MirrorMaker 2, Replicator, or app-level DR patterns), and failover testing. Define and improve Kafka SLOs for availability, durability, and latency.
- Change, Upgrade & Configuration Leadership: Lead medium to high-risk Kafka changes, including broker and cluster configuration changes, partition expansion, large-scale reassignment, and topic policy changes. Support and plan Kafka version upgrades, MSK/Confluent upgrade cycles, and client compatibility/rollout strategies. Participate in CAB reviews, assess risk, and design rollback and validation plans.
- Root Cause Analysis & Continuous Improvement: Own RCA documentation for major incidents with clear corrective and preventive actions (CAPA). Identify recurring failure patterns and architectural gaps. Recommend platform-level improvements such as automation opportunities, guardrails, standards, and monitoring/alerting enhancements. Contribute to the continuous improvement of runbooks, knowledge base articles, and operational playbooks.
- Mentorship & Collaboration: Provide technical guidance and mentoring to Tier 2 Kafka support teams. Collaborate with application teams on Kafka client usage and best practices, Platform and SRE teams on capacity planning and reliability engineering, and Security teams on access control, encryption, and compliance requirements. Act as a subject matter expert for Kafka within the organization.
Required Technical Skills:
- Kafka & Streaming: Strong hands-on experience with Apache Kafka. Experience supporting at least one of: AWS MSK, Confluent Platform/Cloud, or Self-managed Kafka (VM or Kubernetes). Deep understanding of brokers, partitions, replication, ISR, leader election, consumer groups, rebalancing, and producer/consumer internals/failure modes.
- Operations & Performance: Expertise in diagnosing consumer lag, throughput bottlenecks, broker disk/network/JVM performance, and metadata/controller instability. Experience with monitoring and observability tools (Kafka metrics, CloudWatch, Prometheus, Grafana, etc.).
- Security & Governance: Knowledge of Kafka security concepts including TLS, authentication (IAM/SASL/SCRAM), ACLs/RBAC, and principle of least privilege. Experience supporting regulated or multitenant environments.
Preferred / Nice-to-Have Skills:
- Experience with Kafka Connect, Schema Registry, or streaming frameworks.
- Exposure to KRaft-based Kafka deployments.
- Cloud platforms (AWS preferred; Azure/GCP beneficial).
- Automation and IaC experience for Kafka operations.
- Experience in SRE or DevOps-aligned environments.
Professional Attributes:
- Strong analytical and structured problem-solving skills.
- Ability to remain calm and decisive during Sev1 incidents.
- Clear written and verbal communication skills, including executive-level RCA reporting.
- Strong collaboration and stakeholder management skills.
- Proactive mindset focused on prevention and reliability.
Education & Experience:
- Bachelors degree in Computer Science, Engineering, or equivalent experience.
- 8-12+ years of overall IT experience.
- 4+ years of hands-on Kafka production support experience.
- Proven experience supporting business-critical streaming platforms.
Success Measures:
- Mean Time to Resolution (MTTR) for Sev1 and Sev2 Kafka incidents.
- Reduction in repeat Kafka incidents.
- Kafka availability and durability SLO attainment.
- Stability during upgrades and major changes.
- Quality and effectiveness of RCAs and preventive actions.
Special Requirements
Experience supporting regulated or multitenant environments. Experience with AWS MSK, Confluent Platform/Cloud, or Self-managed Kafka. Experience with Kafka Connect, Schema Registry, or streaming frameworks is preferred. Exposure to KRaft-based Kafka deployments is preferred. Cloud platforms (AWS preferred; Azure/GCP beneficial). Automation and IaC experience for Kafka operations is preferred. Experience in SRE or DevOps-aligned environments is preferred.
Compensation & Location
Salary: $166,400 – $249,600 per year
Location: Canton, MA
Recruiter / Company – Contact Information
Email: ith.a@itechus.net
Recruiter Notice:
To remove this job posting, please send an email from
ith.a@itechus.net with the subject:
DELETE_JOB_ID_8945