
Lead Reliability Architect - Property & Casualty IT
- Bratislava, Bratislavský kraj
- Trvalý pracovný pomer
- Plný úväzok
- Own and shape the reliability strategy for our Property & Casualty IT landscape, ensuring alignment with Swiss Re's broader technology and business objectives.
- Overlook the reliability and resilience characteristics of our business-critical application portfolio and drive their continuous improvement.
- Define and maintain blueprints, guidelines, and best practices for resilience, high availability, disaster recovery, and fault tolerance - ensuring they are practical, actionable, and consistently applied across all development teams.
- Work directly with application development teams to support the implementation of these blueprints and architectural principles across the whole Software Development Lifecycle.
- Define and govern the monitoring & alerting baseline for our applications, which includes defining golden signals, SLIs, and SLOs across the whole system landscape.
- Drive the adoption of the OpenTelemetry framework in our observability stack - across applications, platforms, and shared infrastructure.
- Partner closely with Operations (Run) teams to analyze operational incidents and derive actionable insights for improving system reliability and fault response capabilities.
- Act as a bridge between engineering and operations, fostering a culture of reliability, accountability, and continuous improvement.
- Mentor teams and advocate for SRE practices, ensuring a consistent understanding and application of resilience and observability standards across our engineering workforce.
- Well-established track record and senior-level hands-on background in software and reliability engineering with a focus on distributed systems and high-availability architectures in public cloud environments (ideally Azure).
- Deep expertise in reliability and resilience engineering, including concepts like redundancy and failover, fault tolerance and graceful degradation, circuit breakers, retry patterns, chaos engineering, and auto-healing.
- Solid experience in operating applications at scale, ideally within regulated or mission-critical environments.
- Familiarity with Google's Site Reliability Engineering (SRE) practices, especially around SLIs and SLOs, error budgets, and operational readiness.
- Strong background in monitoring, telemetry, and observability, with a focus on defining effective metrics and alerts that reduce noise and improve incident detection.
- Hands-on experience with OpenTelemetry and related observability tools (e.g., Prometheus, Grafana, Jaeger, Elastic, etc.) would be a plus.
- Experience collaborating in DevOps and hybrid cloud environments, ideally with exposure to containerized and microservices architectures.
- Strong thought leadership and influencing skills; ability to challenge the status quo and advocate for meaningful change.
- Architectural mindset, with a structured approach to problem-solving and strong planning and design capabilities.
- High personal integrity, accountability, and a proactive approach to ownership and decision-making.
- Excellent collaboration and communication skills, able to build trusted relationships across teams, functions, and geographies.
- Team player with the ability to work across disciplines and bring people together around shared goals.
- Demonstrated ability to foster understanding between application development and operations teams - serving as a translator and facilitator between the two worlds.
- Fluent in English, both written and spoken.
- the requirements, scope, complexity and responsibilities of the role,
- the applicant's own profile including education/qualifications, expertise, specialisation, skills and experience.
Reference Code: 134631