Production Management | Level 3 (cad) Rsr Msp

Montreal, QC, Canada

Job Description

Job Title: Site Reliability Engineer
Experience Level: Level 3 (senior): 5-7 years
Location: Montreal (Day 1 onboarding onsite / in office presence 3x week)
Job Responsibilities
Client is seeking a Site Reliability Engineer to drive the reliability engineering, operational support and customer consultation services for some key products. Team is a part of the Application Infrastructure organization and is responsible for shaping the SDLC within client by implementing the tools, systems, and processes used by 17,000+ developers in the Firm for software development and deployment.
Job Responsibilities

  • Building and maintaining knowledge front to back of development environment
  • Maximizing the availability and performance of supported systems through optimized and automated plant management, ongoing problem management, and architecture reviews with dev-side peers
  • Reduction of the cost of support (hours of effort) through the elimination of operational issues, optimization and automation of tasks, development of operational tools and driving client self-service to minimize constraints
  • Identification and prioritization of technical debt that is impacting client (i.e. software developers) productivity, system reliability or the efficiency of the Ops team
  • Collaboration with other SREs in Application Infrastructure to share solutions
  • Complex troubleshooting of front to back development environment issues
  • Maximize Ops team product knowledge and support capabilities to minimize the escalation rate to the departments feature engineers/developers
  • Consulting with clients (the Firms development community) to maximize productivity, including troubleshooting their issues with using MSDEs solutions
  • Experimentation with new tools and techniques
  • Being operationally responsive, including sharing on-call rotation with the rest of the global team (with a time-off in lieu system)
Required Qualifications / Skills
  • Strong Linux troubleshooting skills
  • Task automation experience in any programming language, preferably Python
  • Practical experience of implementing monitoring / observability solutions using Prometheus and Grafana
  • Experience with using version control (Bitbucket, Github), issue tracking (Jira), continuous integration (Jenkins, Azure DevOps, Github Actions), automated testing, or deployment automation
  • Excellent communication skills to work with peers / third party vendors
  • Confident collaboration skills
Desired Skills
  • Experience with site reliability engineering practices, like service level objectives (SLOs), error budgets, blameless postmortems, toil reduction
  • Experience with Docker / Kubernetes
*//
EEO Employer
Minorities/ Females/ Disabled/ Veterans/ Gender Identity/ Sexual Orientation
//*

Beware of fraud agents! do not pay money to get a job

MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Related Jobs

Job Detail

  • Job Id
    JD2552459
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Montreal, QC, Canada
  • Education
    Not mentioned