Site Reliability Engineer

Remote, CA, Canada

Job Description

About?Rentsync:

Rentsync is a fast?growing company offering robust software solutions for the multifamily?housing industry. Our platforms--Rentsync, Rentals.ca Network, and more--help property?management companies streamline operations, improve tenant experience, and reach residents across Canada.

About?the Role:

We're looking for a Site Reliability Engineer (SRE) to own the reliability, performance, and scalability of our production systems. You'll work across teams to ensure our platforms run smoothly, securely, and with deep visibility. As an SRE, you will:

Operate & right?size live applications - measure performance, tune capacity, and optimize cloud spend without sacrificing user experience. Lead incidents - act as incident commander, coordinate speedy resolution, run?blameless post?mortems, and ensure follow?through on action items. Build production?grade observability - architect and maintain high?availability clusters of Prometheus/Mimir, Loki, Tempo, and Grafana to give engineers actionable insights. Integrate intelligent alerting - design low?noise, high?signal alert policies and on?call rotations in PagerDuty. Define reliability targets - partner with product teams to establish SLOs, SLIs, and error budgets that balance velocity with stability. Automate everything - codify infrastructure with Terraform and Kubernetes, embed tests into CI/CD, and continuously remove manual toil. Champion security & compliance - weave secrets management, IAM least?privilege, and vulnerability scanning into every layer of the stack. Improve system documentation - ensure key processes and services are clearly documented, up to date, and accessible.

Responsibilities:

o Maintain and optimize cloud infrastructure on AWS?&?GCP, ensuring performance, scalability, and cost?efficiency.

Deploy, scale, and upgrade observability platforms (Prometheus/Mimir, Loki, Tempo, Grafana) in multi?cluster environments. Own on?call, incident response, and PagerDuty administration, driving MTTR and MTTD improvements. Collaborate with development teams to troubleshoot live issues, implement performance fixes, and guide capacity planning. Automate provisioning and configuration with Terraform and Ansible, maintaining high infrastructure?as?code coverage. Perform patching, upgrades, and vulnerability management as part of a secure infrastructure lifecycle. Document runbooks, processes, post?mortems, and architecture decisions to foster knowledge?sharing across teams. Collaborate with the entire team on Kubernetes deployments and configuration. Champion best practices, coach engineers on resilient design, capacity planning, and failure-mode analysis. Collaborate on infrastructure migration projects across various products and systems.

Essential Qualifications:

o 3?+?years in Site Reliability, DevOps, or Production Engineering supporting high?traffic web apps.

Proven experience running large?scale Prometheus/Loki (or Cortex/Mimir) observability stacks. Hands?on PagerDuty setup and on?call experience. Proficiency with AWS or GCP, Kubernetes (EKS/GKE), and Terraform. Strong Linux, networking, and container runtime fundamentals. Ability to write reliable automation in Bash, Python, or JavaScript. Excellent communication skills for cross?team collaboration and incident coordination.

Additional Preferred Qualifications:

o Cloudflare WAF/Workers/Zero Trust exposure.

Database operations for MySQL/Postgres, and caching layers Redis or Memcached. Cost?optimisation and capacity?planning experience at scale. Software development experience Experience supporting many different tech stacks across different teams Familiarity with SIEM platforms like Splunk/Elastic Security

Technologies?You'll?Work?With:

AWS?&?GCP (EKS/GKE, EC2, RDS, S3, ALB/NLB, CloudWatch), Terraform &?Ansible, GitHub Actions/GitLab CI, PagerDuty, LGTM?stack (Loki, Grafana, Tempo, Mimir/Prometheus), OpenTelemetry, Zabbix &?Opsgenie, Cloudflare, Ubuntu &?Amazon?Linux?2, Kubernetes, MySQL &?PostgreSQL, Redis &?Memcached, NGINX &?Traefik, Bash, Python, TypeScript, PHP, Rust.



Rentsync is an equal opportunity employer. If you are selected to participate in the interview process and require unique accommodations, please don't hesitate to let us know.

Successful candidates may be required to complete a criminal background check in the final phase of the interview process.

Beware of fraud agents! do not pay money to get a job

MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Related Jobs

Job Detail

  • Job Id
    JD2516129
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Remote, CA, Canada
  • Education
    Not mentioned