Jobgether

Senior Support Engineer

United Kingdom

Posted 1 day ago

How your CV stacks up

1Upload CV

2Analyse CV

3Improve CV

Upload your CV to see how well it fits this job role

Drag and drop your CV

or browse files

Supported files: PDF, DOC, DOCX

Senior Support Engineer

Senior Support Engineer – Cloud & AI Infrastructure (UK-based) – "Critical Role in AI/ML & Distributed Systems Support"

About the Role

This position is available through a partner company based in United Kingdom. As a Senior Support Engineer, you’ll operate at the heart of production-grade cloud environments, focusing on AI, distributed computing, and GPU workloads.

This hands-on role demands deep technical expertise in diagnosing, escalating, and resolving complex infrastructure issues across:

Linux systems
Kubernetes and containerised environments
Networking, storage, and GPU-based architectures

You’ll act as the lead escalation point for critical incidents, collaborating directly with engineering and customers to restore system stability, conduct root cause analysis, and drive permanent improvements. Beyond traditional support, you’ll contribute to: ✔ Enhancing observability and monitoring tools ✔ Automating troubleshooting workflows ✔ Optimising operational maturity across large-scale cloud platforms

This role suits individuals who thrive in dynamic environments, enjoying the challenge of unambiguous, high-stakes technical problem-solving.

Key Accountabilities

Diagnosis & Resolution

Investigate, troubleshoot, and resolve high-impact production issues with root cause analysis as a top priority.
Debug multi-layered systems, including:
- Linux environments (performance, logging, misconfigurations)
- Kubernetes clusters (node performance, pod behaviour, scaling)
- Networking layers (latency, packet loss, distributed traffic issues)
- Storage systems (I/O bottlenecks, cluster replication)
- GPU-accelerated workloads (driver issues, resource contention)

Escalation & Collaboration

Serve as the senior escalation point for critical incidents, ensuring rapid resolutions.
Work closely with engineering teams to:
- Reproduce and mitigate issues
- Identify systemic dependencies and drive long-term fixes
Support customer-facing incidents, including AI/ML pipelines and inference/training workloads.

Reasons to use Rodeo

I’m in my final year doing Economics and I don’t know whether to apply for grad schemes now or do a masters first. What do you think?

Honest answer — it depends on where you want to end up. A lot of top grad schemes (Big 4, civil service, banking) don’t need a masters. Let’s look at the ones you’d be competitive for now, and we can decide if a masters actually adds anything.

Also worth knowing: most autumn 2026 applications are open now. Timing matters more than you think.

Start with a chat, not a search bar

Grad scheme, placement, apprenticeship? Not sure what you want yet — that's fine. Your agent talks it through with you and turns "I have no idea" into a shortlist.

It searches the market for you

Every day your agent scans the market matching roles against what actually matters to you, not just keywords on a CV.

Only hits

No noise. No "maybe this fits." Just roles with a clear explanation of why they're right — and where to focus when applying.

Tooling & Automation

Develop and improve internal tools (automation scripts, dashboards) in:
- Python, Bash, Go, or equivalent languages.
Enhance scripting efficiency for repetitive troubleshooting tasks.

Observability & Reliability

Contribute to operational excellence by:
- Advocating for better monitoring and alerting mechanisms.
- Streamlining debugging workflows through structured post-incident reviews.
Improve platform reliability and scalability.

Incident Response

Participate in 24/7 incident-response rotations, including weekend on-call shifts.

Requirements

Technical Expertise

Strategic Linux administration (RHEL, Ubuntu, debugging kernel/logs).
Kubernetes expertise, including:
- Deep knowledge of resource scheduling, networking (CNI), and cluster management.
- Experience with self-hosted vs cloud-managed clusters.
Cloud infrastructure proficiency:
- AWS, GCP, Azure, or OpenStack ( ประจำการ ที่ในที่ตั้งว่าง).
- Hands-on experience with orchestration, scaling, and cross-service dependencies.
Networking fundamentals, with skills in:
- Troubleshooting complex distributed networks (routing, DNS, load balancing).
- Latency/connectivity patterns in containerised and cloud-native environments.

Troubleshooting & Debugging

Ability to successfully reproduce and correct incidents under pressure.
Capability to ascertain root causes from logs, metrics, and distributed tracing.
Strong collaboration in cross-team scenarios (DevOps, SRE, platform teams).

Get help with your application

Your very own career expert that helps elevate your application to the next level.

Get help applying for this job

Automation & Scripting

Scripting in Python/Bash/Go for:
- Automating repetitive tasks (diagnostics, infrastructure validation).
- Building lightweight monitoring or alerting scripts.

AI & GPU Experience (Highly Desirable)

Prior experience with GPU-based computing (CUDA, drivers, resource contention).
Work on AI/ML workloads, including:
- Model training pipelines and inference.
- Debugging end-to-end distributed AI models.

Soft Skills

Analytical thinking for ambiguous, high-pressure scenarios.
Clear, non-technical-friendly communication of technical issues.
Ability to documents findings for internal learning/improvement.

Benefits & Perks

Competitive compensation (aligned with experience and expertise).
Career growth opportunities, with emphasis on learning and technical development.
Flexible working, high autonomy, and ownership over systems.
Exposure to cutting-edge technology, including:
- Large-scale distributed systems.
- AI-driven infrastructure.
Collaborative environment with skilled, internationally diverse engineering teams.
Impact-oriented role, operating at a cross-section of modern cloud and AI infrastructures.
Inclusive and innovation-driven culture, focused on:
- Continuous improvement.
- Best practices, not process bureaucracy.

How Jobgether Works

This position is managed through an AI-matching process for fair and efficient candidate reviews. Applied candidates are screened against technical fit, then shortlisted directly for the partner company’s internal hiring team, who handle interview scheduling, assessments, and next steps.

Trusted by 25,000+ job seekers

“It took my CV and asked me questions relevant to understanding what kind of jobs to suggest for me. Suggestions were almost perfect. Jobs were exactly what I’ve been looking for.”

Jessica, London

Get help applying for this job

Skills

Linux

Kubernetes

Cloud Infrastructure

Networking

Scripting

Debugging

AI/ML Workloads

Observability

Automation

Containerized Applications

Distributed Systems

GPU-Based Systems

Technical Communication

Root Cause Analysis

Operational Improvements

Incident Response

Location

United Kingdom