Role Overview
The Data Platform team is looking for an experienced Site Reliability engineer to help us develop a state-of-the-art SRE function. The function will serve as a cornerstone to enable growth, reliability, resiliency, availability and scaling of data onboarding, processing, and analytics. This is a new team which provides the opportunity to innovate and develop a greenfield product within an established and top-notch engineering organization. The ideal candidate will have strong software engineering and SRE skills, a passion for automation, evangelization, mentorship, and an entrepreneurial spirit.
· Help design, develop and evolve highly scalable and reliable infrastructure foundation for the application teams
· Automating application and infrastructure deployments by developing and maintaining CI/CD pipelines
· Engineer solutions to significantly reduce the number of issues in production and troubleshoot time-sensitive production issues
· Perform stress tests, DR tests and enable seamless failovers to prove out production readiness
· Work closely with the Operations team to resolve issues and identify opportunities to improve customer experience
· Infrastructure: AWS, Azure, Kubernetes and on-prem virtualization platform
· CI/CD: Jenkins, TeamCity, Octopus, Azure DevOps
· Infrastructure as Code: Terraform
· OS: Linux and Windows
· Monitoring/Logging: CloudWatch, Prometheus, Splunk, Grafana, OpenTelemetry
· Programming Languages: Python, C++, Rust, JavaScript
· 3+ years of hands-on experience architecting and implementing automation pipelines, monitoring solutions and Infrastructure as Code across an organization
· 3+ years of experience working with immutable infrastructure and automation by using tools like Terraform (or AWS CloudFormation) to deploy complete infrastructure stack
· 3+ years of deploying and managing containerized applications in a multi-tenant Kubernetes environment
· 2+ years of implementing observability patterns/frameworks and using monitoring tools like Prometheus, Grafana, Datadog
· Strong knowledge of CI/CD tools such as Jenkins, Spinnaker, Azure DevOps
· Strong knowledge of Chaos Engineering, containerization technologies (Podman) and configuration management tools (Ansible/Chef/Puppet)
· Experience using managed Kubernetes Platforms like EKS, AKS or GKE
· Experience with big data systems (e.g. Hadoop, Spark, Snowflake, K8s, etc.)