Senior Site Reliability Engineer

Montréal, QC, CA, Canada

Job Description

Senior Site Reliability Engineer (SRE):



As a Senior Site Reliability Engineer, you will lead the design, implementation, and maintenance of highly reliable, scalable, and efficient infrastructure and services. You will collaborate closely with development teams to ensure system reliability, performance, and availability while driving automation and operational excellence across the platform.



Primary Responsibilities:

Lead the design, deployment, and operation of large-scale, fault-tolerant systems to ensure high availability and performance. Develop and implement automation to streamline deployment, monitoring, and incident response processes. Monitor system health, analyze metrics, and proactively identify and resolve reliability, scalability, and performance issues. Collaborate with software engineering teams to improve system design, deployment pipelines, and operational practices. Manage incident response, conduct root cause analysis, and implement corrective actions to prevent recurrence. Drive continuous improvement in infrastructure efficiency, reliability, and scalability through innovative solutions. Document system architecture, operational procedures, and best practices to support knowledge sharing and operational consistency. Mentor and provide technical leadership to junior SREs and cross-functional teams. Participate in on-call rotations to ensure 24/7 system reliability and rapid incident resolution. Engage with stakeholders to align SRE practices with business goals and technical strategies.


Key Skills and Qualifications:

Extensive experience in site reliability engineering, systems engineering, or related roles, typically 5+ years. Strong proficiency with cloud platforms (AWS, Azure, Google Cloud) and container orchestration tools (Kubernetes, Docker). Expertise in Linux system administration, networking, and security best practices. Proficient in programming and scripting languages such as Python, Go, Bash, or similar for automation. Experience with infrastructure as code (Terraform, Ansible, CloudFormation) and CI/CD pipelines. Deep understanding of monitoring, logging, and alerting tools (Prometheus, Grafana, ELK stack). Proven ability to design and maintain scalable, distributed systems and fault-tolerant architectures. Strong problem-solving skills and ability to handle complex technical challenges independently. Excellent communication skills to collaborate effectively across teams and with external vendors. Familiarity with incident management frameworks and service-level objectives (SLOs), service-level agreements (SLAs).


Preferred Qualifications:

Bachelor's degree in Computer Science, Engineering, or a related technical field. Certifications in cloud technologies (AWS Certified Solutions Architect, Google Professional Cloud Architect, etc.). Experience with financial services, large-scale SaaS platforms, or enterprise IT environments. Knowledge of security compliance and regulatory requirements relevant to infrastructure.


Challenges and Impact:

Balancing rapid feature delivery with system reliability and operational stability. Managing complex, multi-platform, geographically distributed environments. Driving automation and efficiency in a constantly evolving technical landscape. Building strong relationships with stakeholders to ensure alignment and seamless service delivery.


LANGUAGE:

French, English


Ability to communicate in English, both orally and in writing, is a requirement as the person in this position will need to collaborate regularly with colleagues and partners in the United States.




Ingenieur(e) senior en fiabilite de site (SRE):



En tant qu'ingenieur(e) senior en fiabilite de site, vous piloterez la conception, la mise en oeuvre et la maintenance d'infrastructures et de services hautement fiables, evolutifs et performants. Vous collaborerez etroitement avec les equipes de developpement pour garantir la fiabilite, les performances et la disponibilite du systeme, tout en favorisant l'automatisation et l'excellence operationnelle sur l'ensemble de la plateforme.



Responsabilites principales:

Diriger la conception, le deploiement et l'exploitation de systemes a grande echelle et tolerants aux pannes afin de garantir une disponibilite et des performances elevees. Developper et mettre en oeuvre l'automatisation pour rationaliser les processus de deploiement, de surveillance et de reponse aux incidents. Surveiller l'etat du systeme, analyser les indicateurs et identifier et resoudre proactivement les problemes de fiabilite, d'evolutivite et de performance. Collaborer avec les equipes d'ingenierie logicielle pour ameliorer la conception du systeme, les pipelines de deploiement et les pratiques operationnelles. Gerer la reponse aux incidents, analyser les causes profondes et mettre en oeuvre des mesures correctives pour eviter qu'ils ne se reproduisent. Favoriser l'amelioration continue de l'efficacite, de la fiabilite et de l'evolutivite de l'infrastructure grace a des solutions innovantes. Documenter l'architecture systeme, les procedures operationnelles et les bonnes pratiques afin de favoriser le partage des connaissances et la coherence operationnelle. Encadrer et assurer le leadership technique des SRE juniors et des equipes transverses. Participer aux rotations d'astreinte pour garantir la fiabilite du systeme 24h/24 et 7j/7 et la resolution rapide des incidents. Collaborer avec les parties prenantes pour aligner les pratiques SRE sur les objectifs metier et les strategies techniques.


Competences et qualifications cles:

Vaste experience en ingenierie de la fiabilite des sites, en ingenierie des systemes ou dans des roles connexes, generalement au moins 5 ans. Maitrise des plateformes cloud (AWS, Azure, Google Cloud) et des outils d'orchestration de conteneurs (Kubernetes, Docker). Expertise en administration systeme Linux, reseaux et bonnes pratiques de securite. Maitrise des langages de programmation et de script tels que Python, Go, Bash ou similaires pour l'automatisation. Experience avec l'infrastructure en tant que code (Terraform, Ansible, CloudFormation) et les pipelines CI/CD. Maitrise approfondie des outils de surveillance, de journalisation et d'alerte (Prometheus, Grafana, pile ELK). Capacite averee a concevoir et maintenir des systemes distribues evolutifs et des architectures tolerantes aux pannes. Solides competences en resolution de problemes et capacite a gerer des defis techniques complexes de maniere autonome. Excellentes competences en communication pour collaborer efficacement avec les equipes et les fournisseurs externes. Familiarite avec les cadres de gestion des incidents, les objectifs de niveau de service (SLO) et les accords de niveau de service (SLA).


Qualifications souhaitees:

Licence en informatique, en ingenierie ou dans un domaine technique connexe. Certifications en technologies cloud (AWS Certified Solutions Architect, Google Professional Cloud Architect, etc.). Experience des services financiers, des plateformes SaaS a grande echelle ou des environnements informatiques d'entreprise. Connaissance de la conformite en matiere de securite et des exigences reglementaires applicables aux infrastructures.


Defis et impact:

Equilibrer la rapidite de livraison des fonctionnalites avec la fiabilite du systeme et la stabilite operationnelle. Gestion d'environnements complexes, multiplateformes et geographiquement disperses. Favoriser l'automatisation et l'efficacite dans un environnement technique en constante evolution. Etablir des relations solides avec les parties prenantes pour garantir l'harmonisation et une prestation de services fluide.


LANGUES:

Francais, anglais

La maitrise de l'anglais, a l'oral comme a l'ecrit, est indispensable, car le/la titulaire de ce poste sera amene(e) a collaborer regulierement avec des collegues et partenaires aux Etats-Unis.

Beware of fraud agents! do not pay money to get a job

MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Related Jobs

Job Detail

  • Job Id
    JD2430287
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Contract
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Montréal, QC, CA, Canada
  • Education
    Not mentioned