Site Reliability Engineer
Posted 7 days 4 hours ago by HCL Technologies B.V. Netherland
Permanent
Not Specified
Other
Noord-Holland, Netherlands
Job Description
Role - Site Reliability Engineer
Locatoin - Amsterdam Netherlands
- Ensure Service Level Objective (SLO) levels are set and met
- Drive Always Available mindset and behavior. Be able to recognize shortcomings in knowledge and expertise, and deliver the necessary resources, skills, guidance and training to DevOps teams where needed.
- Define and enhance standards for logging monitoring and alerting, and actively monitor end to end platform performance through white and black box monitoring tools.
- Improve incident response practices and be actively engaged in incident response of escalated and critical incidents. On call duty is currently not part of the job, but should not be an objection if and when required.
- Participate in Root Cause Analysis. Prioritize and implement the RCA recommendations through improvement plans with the responsible Squads / DevOps teams
- Drive Continuous improvement on all services in the EPI Platform through analysis of the current level of service, functional and technical setup, code, dev/ops practices and the underlying causes of incidents, underperformance, etc.
- Organization and coordination of platform tests like DDOS, DR, Ceiling/Break, and Penetration tests.
- Setting up and maintaining automatic reporting and feedback loops
- Contribute to automating Build, Test and Deployment practices through the CI/CD pipeline
- Contribute to tuning application resources and updating high available deployment patterns of (mostly) container and VM based environments.
- Initiate and contribute to new SRE initiatives like AI Ops, Chaos Engineering, migrations to Public Cloud, and Error Budgeting
Participate and initiate experiments with new tools and concepts, and evaluate it's value against set goals
Background
- Operations expert: 5+ years of experience working using Agile DevOps principles
- Solid understanding how technology setup and ITSM processes relate to service level objectives like Availability (time based, successful call rate, response times), MTTR, and MTBF.
- Good understanding of microservices architecture and related high availability / resilience patterns and experience building systems with multiple layers of redundancy to withstand failures in software, hardware, network infrastructure.
- Proven experience:
- working as a Site Reliability Engineer or DevOps engineer
- scripting in at least one of the following: Ruby, Python, Bash, PowerShell
- set up Build and Deployment pipelines in Azure DevOps (ADO)
- set up white-box monitoring and able to formulate meaningful metrics for monitoring and reporting
- eliminate toil through automation and process optimization