Rodeo
ResourcesPartnersSign in

TGS International Group

Site Reliability Engineer - GPU

London
Posted 2 days ago
Sign up to applySee more jobs like this

How your CV stacks up

1Upload CV
2Analyse CV
3Improve CV

Upload your CV to see how well it fits this job role

?%

Site Reliability Engineer - GPU

Senior Site Reliability Engineer – Data Centres (UK-Based, Remote)

UK-Based – Remote (with occasional travel across Europe) Competitive Salary

Role Description

As a Senior Site Reliability Engineer, you own the stability, availability, and performance of large-scale platforms running across our partners' data centre sites.

You serve as the L3 escalation point for the platform, developing and optimising scalable infrastructure, automating workflows, and implementing monitoring, logging, and alerting solutions.

Core Responsibilities

  • Analyse and resolve the most complex incidents no one else has resolved

  • Conduct root-cause analyses, plan capacity, and collaborate with development and operations to maintain service reliability, security, and efficiency

  • Stay hands-on with:

    • Provisioning automation (Ansible, PXE/iPXE)
    • Infrastructure as Code (Terraform/OpenTofu)
    • Production-grade Python tooling for bare-metal, GPU, and storage fleets
  • Focus on bleeding-edge AI infrastructure, including:

    • NVIDIA GPU systems, BlueField DPUs, and Host-Based Networking (HBN)
    • InfiniBand fabrics and high-performance storage (VAST, Pure Storage, Lustre, GPFS/Spectrum Scale, etc.)
  • Debugging poorly documented hardware and firmware across GPU, DPU, fabric, kernel, and storage layers

  • Participate in a 24×7 follow-the-sun on-call rotation with clear handover processes

  • Work in a remote-first (Germany/UK-based) environment with minimal European travel

  • Shape processes, tools, and best practices, contributing ideas to improve reliability, security, and cost efficiency

  • Act as a technical sparring partner for colleagues

Reasons to use Rodeo

I’m in my final year doing Economics and I don’t know whether to apply for grad schemes now or do a masters first. What do you think?

Honest answer — it depends on where you want to end up. A lot of top grad schemes (Big 4, civil service, banking) don’t need a masters. Let’s look at the ones you’d be competitive for now, and we can decide if a masters actually adds anything.

Also worth knowing: most autumn 2026 applications are open now. Timing matters more than you think.

Start with a chat, not a search bar

Grad scheme, placement, apprenticeship? Not sure what you want yet — that's fine. Your agent talks it through with you and turns "I have no idea" into a shortlist.

P

Graduate Consultant — 2026 Scheme

PwC·London, UK
£35,000/yr

Why you're a good match

Strong

Your economics background and your summer at a regional bank line up with what PwC looks for on the consulting scheme. Applications close in four weeks.

See breakdown
Save jobNot relevant
View details

It searches the market for you

Every day your agent scans the market matching roles against what actually matters to you, not just keywords on a CV.

Why you're a good match

You’ve got the grades and the economics background, and your bank internship is exactly the experience this scheme looks for. Apply soon — deadlines close within the month.

See breakdown
Strong

Experience fit

Your summer at the bank plus your econometrics coursework map directly to the day-one responsibilities on this scheme — client modelling, market briefings, and deal support.

See breakdown
Strong

Only hits

No noise. No "maybe this fits." Just roles with a clear explanation of why they're right — and where to focus when applying.


Qualifications

Required Experience

  • GPU, DPU & Low-Latency Fabrics

    • Hands-on production experience with:
      • NVIDIA GPU systems and BlueField DPUs
      • Host-Based Networking (HBN)
      • InfiniBand fabrics (subnet management, topology, congestion, fault isolation)
      • RoCE (RDMA over Converged Ethernet)
    • Must be confident debugging bare-metal Linux environments with firmware/hardware at the kernel and fabric level
  • High-Performance Storage

    • Real-world operational experience with high-performance storage at scale (e.g., VAST, Pure Storage, Lustre, GPFS/Spectrum Scale, or WekaFS)
    • Ability to diagnose performance and data-path issues across storage and network stacks
  • Site Reliability & Infrastructure Engineering

    • 5+ years in Site Reliability, Platform, or Infrastructure Engineering operating highly available production systems
    • Proficient in infrastructure-as-code (Terraform/OpenTofu, Ansible) in bare-metal environments
    • Strong Linux systems administration expertise (kernel parameters, networking, storage, systemd)

Get help with your application

Your very own career expert that helps elevate your application to the next level.

Get help applying for this job
  • Systems Administration & Troubleshooting

    • Deep hands-on expertise in bare-metal Linux environments (Debian/Ubuntu)
    • Networking fundamentals and security best practices
    • Proven track record in structured L3 problem-solving and root-cause analysis
    • Data centre hardware experience (BMC/iLO/iDRAC, racking)
  • Software Development & Automation

    • Production-grade Python (or Go) for internal tooling and automation
    • Strong Bash scripting for operational workflows
    • Comfortable with CI/CD pipelines, network-based provisioning (PXE, iPXE, preseed, kickstart)
  • Monitoring & Operational Processes

    • Hands-on experience with:
      • Monitoring/observability (Prometheus, VictoriaMetrics, Grafana, OpenTelemetry, ELK)
      • Incident/Change/Problem Management in a 24×7 on-call rotation
  • Collaboration & Communication

    • Technical sparring partner for development and operations teams
    • Excellent written and spoken English (team operates fully in English)
    • Based in the UK or Germany

Bonus Experience

  • NVIDIA DOCA and firmware-level troubleshooting
  • HPC/AI-ML cluster operations (SLURM)
  • Kubernetes (ingress-nginx, cert-manager, ArgoCD, Kustomize)
  • Harbor, OpenBao, Portworx
  • DCIM tools (Nautobot, NetBox)
  • Virtualisation (Proxmox VE, OpenShift Virtualization/KubeVirt)
Trusted by 25,000+ job seekers

“It took my CV and asked me questions relevant to understanding what kind of jobs to suggest for me. Suggestions were almost perfect. Jobs were exactly what I’ve been looking for.”

Jessica, London

Get help applying for this job

Skills

Site Reliability Engineering
NVIDIA GPU Systems
BlueField DPUs
Host-Based Networking
InfiniBand Fabrics
High-Performance Storage
Infrastructure as Code
Terraform
Ansible
Linux
Python
Bash
Monitoring
Observability
CI/CD
Troubleshooting

Location

London, England, United Kingdom

Sign up to applySee more jobs like this