Site Reliability Engineer (remote Canada)

Toronto, ON, Canada

Job Description



Why We Need You:
We are looking for an experienced Site Reliability Engineer to join Uptime.com and help us build reliable, robust software solutions for our customers. As a Site Reliability Engineer, you will be responsible for monitoring system performance, troubleshooting technical issues, deploying code changes, and collaborating with other teams to ensure the best possible customer experience. The ideal candidate should have extensive experience in cloud infrastructure, distributed systems engineering, scripting and automation tools such as Docker containers and Kubernetes clusters. Additionally, you should possess the skills needed to manage service outages and ensure system availability by writing scalable software solutions. What You Will Do:

  • Monitor system health metrics to proactively identify potential bottlenecks or errors
  • Develop strategies for resolving performance issues and identify areas of improvement
  • Manage monitoring tools like Grafana and Prometheus including deploying and optimizing their usage
  • Deploy releases of applications and services in collaboration with developers
  • Troubleshoot production outages and implement fault tolerance solutions
  • Maintain documentation related to system operation procedures
  • Document game-day scenarios and test these scenarios
  • Develop and support automation that allows for continuous testing of software created by the team
  • Design and assist in the setup and maintenance of application monitoring and alerting
  • Assist in designing and deploying HA/DR architecture for mission critical workloads
  • Collaborate with other teams to ensure optimal performance of system and dependent resources
  • Participate in on-call duty rotation
Requirements What You Will Need:
  • Bachelor\'s degree in Computer Science or relevant field preferred
  • 5+ years of experience in SRE/DevOps roles
  • Good communicator and able to clearly articulate complex issues and technologies.
  • Expertise in Linux server administration and scripting languages (Python)
  • Knowledge of containerization technologies like Docker & Kubernetes
  • Proficient in a modern scripting language like GO or Python
  • Deep understanding of modern microservices based architectures and operations
  • Experience in defensive coding practices and patterns for high-availability.
  • Familiarity with configuration management tools
  • Excellent problem solving skills & strong collaboration abilities
  • Be comfortable working in a fast-paced agile environment. Requirements change quickly and our team needs to constantly adapt to moving targets.
Benefits How we will support your growth and success:
  • Partner with executives, leadership and cross-functional organization including engineering, marketing and business operations.
  • Professional development opportunities to further skills and knowledge
  • Discover the exciting world of monitoring, observability, and SRE while becoming an advocate and drive innovation in the industry.
  • A supportive team of passionate and dedicated individuals all focused on building the best monitoring service in the world.
  • Health Care Plan (Medical, Dental & Vision)
  • Paid Time Off (Vacation, Sick & Public Holidays)
  • Family Leave (Maternity, Paternity)
  • Training & Development
  • Work From Home

Beware of fraud agents! do not pay money to get a job

MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD2190485
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Toronto, ON, Canada
  • Education
    Not mentioned