As a Site Reliability Engineer on our team, you will be responsible for the architecture, maintenance, and reliability of our applications. This role enables you to work closely with the engineering teams across various disciplines such as Production Operations, Developers, DevOps. Ideal goal of the SRE team is to aid adoption of cloud native solutions, build tools and automate mundane tasks, which helps in managing our system(s) and reducing toil.
What You’ll Do:
- Use hands on SRE best practices to support with continuous improvement; automate deployments and aid in architectural design.
- Collaborate with diverse agile teams (i.e. Infrastructure, Information Security, System/Application Development Teams and external partners/vendors) to assist with the implementation of complex innovative features
- Identifying and diagnosing deficiencies related to systems, coding and infrastructure, and recommending creative solutions for mitigation
- Determine optimal configurations for application software, application servers, and database connections. Configure, tune and troubleshoot multi-tiered systems to achieve optimal application performance, stability and availability
- Manage a highly-scalable and high-availability platform monitoring and maintaining service performance and availability metrics
What you Bring:
- Blend of both Development and SRE mindset with experience developing operational SLOs and KPI aligned with the principles of ITIL Incident Management, Problem Management and Change Management
- Understanding of CI/CD and associated tools with development experience in scripting languages such as Python, Groovy or Go
- Working knowledge of Cloud Platforms such as Azure, and/or GCP as well as Infrastructure Code (IaC) (Ansible, Terraform) and container orchestrations platforms such as Kubernetes/OpenShift
- Experience in monitoring & alerting systems (Nagios, Dynatrace, Prometheus)