Site Reliability Engineer

Posted 7 days 4 hours ago by HCL Technologies B.V. Netherland

Permanent

Not Specified

Other

Noord-Holland, Netherlands

Job Description

Role - Site Reliability Engineer

Locatoin - Amsterdam Netherlands

Ensure Service Level Objective (SLO) levels are set and met
Drive Always Available mindset and behavior. Be able to recognize shortcomings in knowledge and expertise, and deliver the necessary resources, skills, guidance and training to DevOps teams where needed.
Define and enhance standards for logging monitoring and alerting, and actively monitor end to end platform performance through white and black box monitoring tools.
Improve incident response practices and be actively engaged in incident response of escalated and critical incidents. On call duty is currently not part of the job, but should not be an objection if and when required.
Participate in Root Cause Analysis. Prioritize and implement the RCA recommendations through improvement plans with the responsible Squads / DevOps teams
Drive Continuous improvement on all services in the EPI Platform through analysis of the current level of service, functional and technical setup, code, dev/ops practices and the underlying causes of incidents, underperformance, etc.
Organization and coordination of platform tests like DDOS, DR, Ceiling/Break, and Penetration tests.
Setting up and maintaining automatic reporting and feedback loops
Contribute to automating Build, Test and Deployment practices through the CI/CD pipeline
Contribute to tuning application resources and updating high available deployment patterns of (mostly) container and VM based environments.
Initiate and contribute to new SRE initiatives like AI Ops, Chaos Engineering, migrations to Public Cloud, and Error Budgeting

Participate and initiate experiments with new tools and concepts, and evaluate it's value against set goals

Background

Operations expert: 5+ years of experience working using Agile DevOps principles
Solid understanding how technology setup and ITSM processes relate to service level objectives like Availability (time based, successful call rate, response times), MTTR, and MTBF.
Good understanding of microservices architecture and related high availability / resilience patterns and experience building systems with multiple layers of redundancy to withstand failures in software, hardware, network infrastructure.
Proven experience:
working as a Site Reliability Engineer or DevOps engineer
scripting in at least one of the following: Ruby, Python, Bash, PowerShell
set up Build and Deployment pipelines in Azure DevOps (ADO)
set up white-box monitoring and able to formulate meaningful metrics for monitoring and reporting
eliminate toil through automation and process optimization