Senior Site Reliability Engineer
Cambridge
£70k – £90k/yr
Posted 4 days ago
Early applicant
Hybrid
Full-time
Senior Level
Senior Site Reliability Engineer (SRE) £70,000-£90,000 Cambridge 2/3 days per week No Sponsorship Option
We’re hiring a Senior Site Reliability Engineer to help strengthen the reliability, observability and operational maturity of a cloud-native SaaS platform operating within a regulated environment. This is a hands-on role focused on production systems, monitoring, incident response, automation and operational excellence across a Kubernetes-based AWS platform.
You’ll work closely with Platform Engineering and Application teams to improve system health, reduce operational risk and build scalable reliability practices as the business continues to grow.
Key responsibilities: • Building and improving observability across metrics, logs and traces • Developing actionable dashboards, alerts, runbooks and operational tooling • Supporting production systems, incident response and root cause analysis • Improving reliability, resilience, deployment feedback loops and operational readiness • Identifying operational inefficiencies and automating repetitive toil • Driving post-incident reviews and long-term corrective improvements • Helping define SLOs, SLIs and reliability standards across customer-critical services
Tech environment includes: AWS | Kubernetes / EKS | Observability | Prometheus | Grafana | OpenTelemetry | GitOps | Argo CD | CI/CD | Cloud Operations
We’re looking for someone with: ✔ Strong experience supporting Kubernetes-based production environments ✔ Practical AWS and cloud-native infrastructure knowledge ✔ Deep troubleshooting skills across distributed systems ✔ Experience with observability, monitoring and incident management ✔ Strong scripting or automation capability (Python, Go, Bash, TypeScript etc.) ✔ Calm, pragmatic thinking during live operational incidents ✔ Passion for improving reliability and reducing operational noise
Experience within SaaS, fintech or regulated environments would be highly beneficial.
This is an excellent opportunity for an engineer who enjoys solving real production challenges, improving operational resilience and building mature SRE practices within a scaling engineering organisation.
Skills
Kubernetes
AWS
Observability
Monitoring
Incident Management
Scripting
Automation
Production Systems
Cloud-Native Infrastructure
Troubleshooting
Operational Excellence
Root Cause Analysis
Post-Incident Reviews
Reliability
Deployment Feedback
Operational Readiness