Senior Site Reliability engineer
Posted 1 day 1 hour ago by Dormont Manufacturing Co
As a Senior Site Reliability Engineer, you will be part of a team that is passionately automating everything possible to make Guidewire systems run more efficiently. The Platform team is dedicated full-time to creating and running software that improves the reliability of systems in production, serving hundreds of customers and supporting millions of transactions each day. You will be ensuring the reliability of Guidewire's flagship cloud platform and InsuranceSuite products and building tooling to help ensure efficient operations and optimal availability of all SaaS multi-tenant and customer focused systems. Platform SREs collaborate closely with Guidewire's core product developers to ensure that the Guidewire core cloud products address functional and non functional requirements such as availability, performance, observability, and maintainability.
Responsibilities- Take a purist SRE approach to shared multi tenant infrastructure for a resilient SaaS microservice based containerized systems in addition to customer centric application environments
- Oversee and automate the team's growing presence in AWS
- Contribute to core infrastructure systems development with features, bug fixes, reliability improvements, etc
- Platform reliability engineering of a complex single sign on SAML/OAuth based central authentication platform
- Creatively build and develop tooling to aid in driving 24x7x365 follow the sun operations of critical production systems
- Automate deployment tasks for core product and infrastructure tools and maintain automation infrastructure
- Create system documentation and training materials to empower and educate our fellow team members
- Build and maintain observability tooling, metrics, and dashboarding for a global platform product infrastructure
- Improve our incident management lifecycle to identify, mitigate, and learn from reliability risks and issues
- Enhance platform observability with helping create a self healing approach to platform reliability
- Collaborate with engineering teams, providing product feedback and where necessary contribute code to the product
- Bachelor's Degree in Computer Science or related field
- Software engineering and task automation skills with Bash, Python, and/or Go are a must.
- Solid understanding of agile software development methodologies (Scrum, Kanban, etc.)
- Deep background with Linux systems and engineering
- Highly experienced with engineering and automating on Amazon Web Services (AWS)
- Experience supporting web applications running on Java / Apache / Tomcat in a live production environment
- Prior experience with IaC tools like Terraform/Terragrunt/Terraspace
- Prior experience with devops/gitops tools (Git, Bitbucket, Flux CD, Teamcity) for gate promotions
- Production At Scale support background in a heavily microservice based world
- Hands on engineering and ops expertise in containerization (Docker, Helm, Kubernetes/EKS, CNI and Ingress networking)
- Strong understanding of Single Sign On, SAML, OAuth (Bonus if hands on experience with Okta)
- Seasoned expertise around x.509 certificate technology and basic concepts of encryption
- Experience working with Relational Databases such as Aurora Postgres and/or Oracle RDS
- Advanced exposure to application development, web UI (design and development), JSON, application architecture
- Experience strongly utilizing observability tools (logging/APM) like Datadog, CloudWatch, and PagerDuty.
- Familiarity with event store/stream-processing technologies like Kafka or AWS SQS
- Understanding of Open Application Model systems such as KubeVela or Crossplane
- Demonstrated ability to embrace AI and apply it to your role as well as use data driven insights to drive innovation, productivity, and continuous improvement.
- You greatly prefer writing code than clicking a GUI.
- You enjoy teaching, being a mentor to others, and working across boundaries
- Outstanding troubleshooting skills; ability to think critically and display an aptitude for problem solving
- Strong analytical mind with a penchant for process development and enhancement
- A highly positive can do attitude with desire for being a team player
- Great communication skills and ability to explain complex technical concepts to a varied audience
- Demonstrate strong follow through, a strong work ethic and consistently keep and meet commitments
- Ability to champion a culture of reliability within the product team, promoting practices like blameless postmortems, SLO tracking, and continuous learning from incidents.
- Ability to read, write, and speak English
- We provide 24x7 support to our customers, so we expect you to take turns with your teammates being on call for weekend production emergencies or to provide rotating weekend operational support
- Prefer candidates who can work on PST timings primarily
- Travel - Expect occasional travel (less than 5%) to other Guidewire offices for training and team meetings
Guidewire Software, Inc. is proud to be an equal opportunity and affirmative action employer. We are committed to an inclusive workplace, and believe that a diversity of perspectives, abilities, and cultures is a key to our success. Qualified applicants will receive consideration without regard to race, color, ancestry, religion, sex, national origin, citizenship, marital status, age, sexual orientation, gender identity, gender expression, veteran status, or disability. All offers are contingent upon passing a criminal history and other background checks where it's applicable to the position.