SoCode Recruitment

Senior Site Reliability Engineer

Cambridge

£70k – £90k/yr

Posted 4 days ago

Early applicant

Hybrid

Full-time

Senior Level

Senior Site Reliability Engineer (SRE) £70,000-£90,000 Cambridge 2/3 days per week No Sponsorship Option

We’re hiring a Senior Site Reliability Engineer to help strengthen the reliability, observability and operational maturity of a cloud-native SaaS platform operating within a regulated environment. This is a hands-on role focused on production systems, monitoring, incident response, automation and operational excellence across a Kubernetes-based AWS platform.

You’ll work closely with Platform Engineering and Application teams to improve system health, reduce operational risk and build scalable reliability practices as the business continues to grow.

Key responsibilities: • Building and improving observability across metrics, logs and traces • Developing actionable dashboards, alerts, runbooks and operational tooling • Supporting production systems, incident response and root cause analysis • Improving reliability, resilience, deployment feedback loops and operational readiness • Identifying operational inefficiencies and automating repetitive toil • Driving post-incident reviews and long-term corrective improvements • Helping define SLOs, SLIs and reliability standards across customer-critical services

Tech environment includes: AWS | Kubernetes / EKS | Observability | Prometheus | Grafana | OpenTelemetry | GitOps | Argo CD | CI/CD | Cloud Operations

We’re looking for someone with: ✔ Strong experience supporting Kubernetes-based production environments ✔ Practical AWS and cloud-native infrastructure knowledge ✔ Deep troubleshooting skills across distributed systems ✔ Experience with observability, monitoring and incident management ✔ Strong scripting or automation capability (Python, Go, Bash, TypeScript etc.) ✔ Calm, pragmatic thinking during live operational incidents ✔ Passion for improving reliability and reducing operational noise

Experience within SaaS, fintech or regulated environments would be highly beneficial.

This is an excellent opportunity for an engineer who enjoys solving real production challenges, improving operational resilience and building mature SRE practices within a scaling engineering organisation.

Skills

Kubernetes

AWS

Observability

Monitoring

Incident Management

Scripting

Automation

Production Systems

Cloud-Native Infrastructure

Troubleshooting

Operational Excellence

Root Cause Analysis

Post-Incident Reviews

Reliability

Deployment Feedback

Operational Readiness