Site Reliability Engineer (SRE) (11844)

Location: Vancouver, BC, Canada
Date Posted: 10-05-2018
Description:

We are seeking a Site Reliability Engineer (SRE) who excels at exploration, planning, and executing in a fast-paced, technical environment. You derive a deep satisfaction from developing elegant and robust automated systems that removes complexity and toil from all things operational. We need your support to help client ensure that our customer facing systems have the utmost reliability while balancing the risk tolerance required to motivate change and innovation. You will also be a key influencer on operational strategy as well as being an advocate for enterprise level change while having an addiction to digital transformation. 

You embrace an engineering mindset in your approach to problem solving. You bring your own creative solutions to operational constraints. Much of your development focus will be on optimizing or changing existing systems while eliminating manual work through software engineering. You will support the operations department with problem management and updating processes to complement the solutions you design. We need creative thinkers and gritty executors, responsible for understanding the complex systems architecture of a telcom, while using a breadth of tools and approaches to solve a broad spectrum of issues. Help us limit the time spent on operational work and to find opportunities to enhance system quality and promote safe increases in feature velocity. 

Additionally, you will be an ambassador of technology with a penchant for complex solutions architecture demonstrating your advanced skills on our blended platforms. 

We urge you to think big and take risks. We promote self direction and organization, while providing you the support and mentorship needed to thrive. 

What you’ll be responsible for: 
Engage in and improve the entire lifecycle of services—from inception and design, through deployment, operation and refinement 

Develop self-service tooling to assist in freeing our development teams from being bottlenecked by operations support

Identify problems in critical services, develop automated processes for eliminating future occurrences where possible, and propose changes to existing configurations to form the base for automation 

Support services before they go live through activities such as system design consulting and launch reviews 

Maintain services once they are live by considering all aspects of supportability, reliability and performance 

Practice sustainable incident response and blameless postmortems to drive continuous improvement 

Assist with migrating complex, multi-tier applications to cloud environments 

Design and deploy enterprise-wide scalable operations on mixed architectures 

What you’ll need to be successful: 
Proven ability to write programs in Javascript 

Experience running large, diverse architectures with configuration management systems like: Ansible(preferred), Puppet, Chef, or Salt 

Deep understanding of the Linux Operating System, including Kernel, Memory, Process, Threads, Static / Shared Libraries, IPC, Signals 

Understanding of standard networking protocols and components such as: HTTP, DNS, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing 

Familiarity with distributed systems paradigms such as the CAP Theorem, Microservices, and the Twelve Factor App 

Familiarity with the AWS Operations Pillar of Excellence 

Passion for eliminating repetitive manual processes using automation 

Systematic problem-solving approach, coupled with strong communication skills, ownership and drive 

Ability to debug and optimize code and automate routine tasks 

Experience with continuous integration and continuous delivery pipelines 

Interest in designing, analyzing and optimizing large-scale distributed systems 

Able to design and implement simple, secure solutions for complex problems in distributed systems 

Strong sense of ownership, customer service, and integrity proven through clear communication 

Bachelors or Masters degree, or equivalent experience 

Must to have skills:
1. Experience running large, diverse architectures with configuration management systems like: Ansible(preferred), Puppet, Chef, or Salt and proven ability to write programs in Javascript
2. Deep understanding of the Linux Operating System, including Kernel, Memory, Process, Threads, Static / Shared Libraries, IPC, Signals, along with an understanding of standard networking protocols and components such as: HTTP, DNS, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing
3. Familiarity with distributed systems paradigms such as the CAP Theorem, Microservices, and the Twelve Factor App and the AWS Operations Pillar of Excellence

 Nice to have skills:
1. Interest in designing, analyzing and optimizing large-scale distributed systems
2. Digital experience
or
this job portal is powered by CATS