The Senior Site Reliability Engineer ensures the reliability, stability, and performance of Ericsson's software platforms by implementing monitoring systems, optimizing CI/CD workflows, and enhancing incident management processes.
Position Summary
The Site Reliability Engineer (SRE) plays a pivotal role in ensuring the reliability, stability, and performance of Ericsson's critical software platforms. This position combines software engineering expertise with a focus on operational efficiency to minimize downtime and drive reliability across large-scale developer platforms. The ideal candidate will leverage their experience to enhance monitoring systems, streamline incident management processes, and improve resilience, enabling Ericsson's global engineering teams to innovate without disruption. A proven ability to implement enterprise-level reliability solutions and best practices is essential for this high-impact role.
• Manage the reliability and scalability of Ericsson's global multi-node GitLab infrastructure, supporting thousands of repositories. Responsibilities include implementing automated failover and redundancy, deploying robust 24/7 monitoring with industry-standard tools, and ensuring seamless CI/CD workflows through proactive performance optimization and rapid incident response.
• Drive operational excellence for developer platforms such as GitLab, Backstage, and other enterprise-scale systems by advancing platform health and facilitating uninterrupted software delivery.
• Design and implement tools, processes, and frameworks for effective incident management, performance optimization, and fault tolerance, ensuring consistent reliability across platforms.
• Lead automation and monitoring initiatives to reinforce the resilience of Ericsson's platforms in diverse operational settings, supporting continuous improvement and scalability.
Core Responsibilities
1. Enhance Platform Reliability and Stability
a. Identify and mitigate reliability risks across Ericsson's developer platforms, with a focus on proactively improving platform resilience.
b. Architect solutions for fault tolerance and failure recovery within hybrid, cloud-native, or distributed infrastructures.
c. Monitor and optimize system performance metrics such as error rates, latency, and uptime to ensure seamless operations.
2. Develop Comprehensive Monitoring and Observability Frameworks
a. Implement real-time monitoring systems (e.g., Prometheus, Grafana, ELK, OpenTelemetry) to track system performance and uncover reliability patterns.
b. Create dashboards and telemetry tools to provide deep insights into platform health, facilitating proactive issue resolution and scaling decisions.
3. Incident Response and Post-Mortem Management
a. Establish scalable incident management protocols, including real-time responses, root cause analyses (RCA), and structured post-mortems.
b. Develop automation pipelines for detection, diagnosis, and recovery from system disruptions or outages.
c. Minimize mean time to resolution (MTTR) for key platforms while maintaining a high standard for post-incident reporting and improvement planning.
4. Optimize Performance at Scale
a. Implement capacity planning and system hardening strategies that support increasing developer activity and workload demand.
b. Design resilient architectures capable of scaling while maintaining adherence to Ericsson's stringent performance benchmarks.
5. Promote Collaboration to Achieve Operational Excellence
a. Partner with platform engineers to integrate reliability principles into IDP workflows, ensuring CI/CD pipelines are fault-tolerant and scalable.
b. Mentor engineering teams and contribute to shaping a culture of continuous improvement in reliability practices.
c. Collaborate across functional teams (Engineering, DevOps, Infrastructure) to align platform reliability efforts with Ericsson's strategic goals.
Preferred Qualifications
• Direct Experience:
o 5-8 years in SRE or platform engineering roles, demonstrating success in scaling reliability solutions across large, global platforms.
o Proven contributions to reducing downtime, improving platform performance metrics, and establishing operational governance practices.
o Documented achievements in leading reliability-focused initiatives within complex organizational environments.
• Technical Expertise:
o Advanced expertise in tooling for monitoring and observability, including Prometheus, Grafana, Splunk, ELK Stack, or OpenTelemetry.
o Proficiency in automation and infrastructure tools such as Kubernetes, Terraform, or Ansible.
o Strong programming skills (e.g., Python, Go, or Bash) for developing system optimizations and observability tools.
o Understanding of CI/CD frameworks and platform governance principles, specifically in enterprise solutions like GitLab.
• Interpersonal Skills:
o Proven ability to lead cross-functional initiatives and mentor teams on reliability best practices.
o Excellent problem-solving skills with a proactive approach to turning reliability needs into actionable solutions.
o Results-driven focus on measurable impact and continuous improvement.
Impact of the Role
• Strategic Enhancement:
o By ensuring the reliability of core developer platforms such as GitLab and Backstage, the SRE contributes to Ericsson's ability to deliver high-quality software faster and with consistent performance.
o Reliability efforts empower Ericsson's global engineering teams to innovate freely while maintaining operational stability.
• High-Impact Leadership:
o This role demands systems-level leadership to shape a robust site reliability culture and drive architectural innovations that prioritize resilience at scale.
o Implementing effective reliability frameworks will directly benefit Ericsson's competitiveness in the telecom industry by enabling reliable delivery across high-demand development ecosystems.
Why join Ericsson?At Ericsson, you'll have an outstanding opportunity. The chance to use your skills and imagination to push the boundaries of what's possible. To build solutions never seen before to some of the world's toughest problems. You'll be challenged, but you won't be alone. You'll be joining a team of diverse innovators, all driven to go beyond the status quo to craft what comes next.
What happens once you apply? Click Here to find all you need to know about what our typical hiring process looks like.Encouraging a diverse and inclusive organization is core to our values at Ericsson, that's why we champion it in everything we do. We truly believe that by collaborating with people with different experiences we drive innovation, which is essential for our future growth. We encourage people from all backgrounds to apply and realize their full potential as part of our Ericsson team. Ericsson is proud to be an Equal Opportunity Employer. learn more.
Primary country and city: Ireland (IE) || Athlone
Req ID: 780375
The Site Reliability Engineer (SRE) plays a pivotal role in ensuring the reliability, stability, and performance of Ericsson's critical software platforms. This position combines software engineering expertise with a focus on operational efficiency to minimize downtime and drive reliability across large-scale developer platforms. The ideal candidate will leverage their experience to enhance monitoring systems, streamline incident management processes, and improve resilience, enabling Ericsson's global engineering teams to innovate without disruption. A proven ability to implement enterprise-level reliability solutions and best practices is essential for this high-impact role.
• Manage the reliability and scalability of Ericsson's global multi-node GitLab infrastructure, supporting thousands of repositories. Responsibilities include implementing automated failover and redundancy, deploying robust 24/7 monitoring with industry-standard tools, and ensuring seamless CI/CD workflows through proactive performance optimization and rapid incident response.
• Drive operational excellence for developer platforms such as GitLab, Backstage, and other enterprise-scale systems by advancing platform health and facilitating uninterrupted software delivery.
• Design and implement tools, processes, and frameworks for effective incident management, performance optimization, and fault tolerance, ensuring consistent reliability across platforms.
• Lead automation and monitoring initiatives to reinforce the resilience of Ericsson's platforms in diverse operational settings, supporting continuous improvement and scalability.
Core Responsibilities
1. Enhance Platform Reliability and Stability
a. Identify and mitigate reliability risks across Ericsson's developer platforms, with a focus on proactively improving platform resilience.
b. Architect solutions for fault tolerance and failure recovery within hybrid, cloud-native, or distributed infrastructures.
c. Monitor and optimize system performance metrics such as error rates, latency, and uptime to ensure seamless operations.
2. Develop Comprehensive Monitoring and Observability Frameworks
a. Implement real-time monitoring systems (e.g., Prometheus, Grafana, ELK, OpenTelemetry) to track system performance and uncover reliability patterns.
b. Create dashboards and telemetry tools to provide deep insights into platform health, facilitating proactive issue resolution and scaling decisions.
3. Incident Response and Post-Mortem Management
a. Establish scalable incident management protocols, including real-time responses, root cause analyses (RCA), and structured post-mortems.
b. Develop automation pipelines for detection, diagnosis, and recovery from system disruptions or outages.
c. Minimize mean time to resolution (MTTR) for key platforms while maintaining a high standard for post-incident reporting and improvement planning.
4. Optimize Performance at Scale
a. Implement capacity planning and system hardening strategies that support increasing developer activity and workload demand.
b. Design resilient architectures capable of scaling while maintaining adherence to Ericsson's stringent performance benchmarks.
5. Promote Collaboration to Achieve Operational Excellence
a. Partner with platform engineers to integrate reliability principles into IDP workflows, ensuring CI/CD pipelines are fault-tolerant and scalable.
b. Mentor engineering teams and contribute to shaping a culture of continuous improvement in reliability practices.
c. Collaborate across functional teams (Engineering, DevOps, Infrastructure) to align platform reliability efforts with Ericsson's strategic goals.
Preferred Qualifications
• Direct Experience:
o 5-8 years in SRE or platform engineering roles, demonstrating success in scaling reliability solutions across large, global platforms.
o Proven contributions to reducing downtime, improving platform performance metrics, and establishing operational governance practices.
o Documented achievements in leading reliability-focused initiatives within complex organizational environments.
• Technical Expertise:
o Advanced expertise in tooling for monitoring and observability, including Prometheus, Grafana, Splunk, ELK Stack, or OpenTelemetry.
o Proficiency in automation and infrastructure tools such as Kubernetes, Terraform, or Ansible.
o Strong programming skills (e.g., Python, Go, or Bash) for developing system optimizations and observability tools.
o Understanding of CI/CD frameworks and platform governance principles, specifically in enterprise solutions like GitLab.
• Interpersonal Skills:
o Proven ability to lead cross-functional initiatives and mentor teams on reliability best practices.
o Excellent problem-solving skills with a proactive approach to turning reliability needs into actionable solutions.
o Results-driven focus on measurable impact and continuous improvement.
Impact of the Role
• Strategic Enhancement:
o By ensuring the reliability of core developer platforms such as GitLab and Backstage, the SRE contributes to Ericsson's ability to deliver high-quality software faster and with consistent performance.
o Reliability efforts empower Ericsson's global engineering teams to innovate freely while maintaining operational stability.
• High-Impact Leadership:
o This role demands systems-level leadership to shape a robust site reliability culture and drive architectural innovations that prioritize resilience at scale.
o Implementing effective reliability frameworks will directly benefit Ericsson's competitiveness in the telecom industry by enabling reliable delivery across high-demand development ecosystems.
Why join Ericsson?At Ericsson, you'll have an outstanding opportunity. The chance to use your skills and imagination to push the boundaries of what's possible. To build solutions never seen before to some of the world's toughest problems. You'll be challenged, but you won't be alone. You'll be joining a team of diverse innovators, all driven to go beyond the status quo to craft what comes next.
What happens once you apply? Click Here to find all you need to know about what our typical hiring process looks like.Encouraging a diverse and inclusive organization is core to our values at Ericsson, that's why we champion it in everything we do. We truly believe that by collaborating with people with different experiences we drive innovation, which is essential for our future growth. We encourage people from all backgrounds to apply and realize their full potential as part of our Ericsson team. Ericsson is proud to be an Equal Opportunity Employer. learn more.
Primary country and city: Ireland (IE) || Athlone
Req ID: 780375
Top Skills
Ansible
Bash
Elk Stack
Gitlab
Go
Grafana
Kubernetes
Opentelemetry
Prometheus
Python
Terraform
Similar Jobs at Ericsson
Cloud • Information Technology • Internet of Things • Machine Learning • Software • Cybersecurity • Infrastructure as a Service (IaaS)
The role involves designing, implementing, and operating cloud-native services using Kubernetes, automating CI/CD processes, and managing system reliability and performance.
Top Skills:
AnsibleC/C++CephDockerEbpfElkGitGitlabGitopsGoJenkinsKubeflowKubernetesPythonRancherRedisSpinnakerTerraform
Cloud • Information Technology • Internet of Things • Machine Learning • Software • Cybersecurity • Infrastructure as a Service (IaaS)
Manage and develop Ericsson's product portfolio, driving strategies, business development, and performance analysis while overseeing communication and improvements.
Top Skills:
Business ModelingCustomer InsightsEricsson KnowledgeFinancial AcumenMarket InsightsProduct Management
Cloud • Information Technology • Internet of Things • Machine Learning • Software • Cybersecurity • Infrastructure as a Service (IaaS)
You will improve the developer experience through UX-driven design, enhance internal platforms, optimize pipelines, and integrate cloud-native services. Your role involves measuring UX metrics and collaborating with teams to foster continuous improvement.
Top Skills:
BackstageCi/CdCloud-NativeDevOpsGitlab
What you need to know about the Edinburgh Tech Scene
From traditional pubs and centuries-old universities to sleek shopping malls and glass-paneled office buildings, Edinburgh's architecture reflects its unique blend of history and modernity. But the fusion of past and future isn't just visible in its buildings; it's also shaping the city's economy. Named the United Kingdom's leading technology ecosystem outside of London, Edinburgh plays host to major global companies like Apple and Adobe, as well as a growing number of innovative startups in fields like cybersecurity, finance and healthcare.

