The purpose of the position in relation to the company as a whole:
The Site Reliability Engineer (SRE) plays a pivotal role in ensuring the availability, scalability, and security of our production environment. This is a hands-on engineering role that combines software development with operational responsibilities. You will spend a significant portion of your time writing code and implementing architectural improvements that directly enhance performance, reliability, and security. In addition, you will assist with managing and improving our monitoring and alerting systems, contribute to AWS infrastructure management, and implement other changes that strengthen platform security and scalability. You will also collaborate with our support team to assist in investigating and resolving complex issues in high-demand environments.
The role requires a balance of proactive development and operational problem-solving, making it ideal for a senior-level software engineer who is passionate about building scalable systems, applying strong architectural judgment, and driving measurable improvements in platform reliability.
Critical qualifications:
Bachelor's degree in Computer Science, Engineering, or a related field or equivalent work experience.
4+ years of experience in a SRE, Software Engineering, or Platform Engineering role.
Strong programming skills in at least one server-side language (Python, Java, Go, PHP, etc.), with the ability to write production-ready code.
Demonstrated experience with scalability, performance tuning, and architectural design.
Strong knowledge of cloud infrastructure (AWS preferred).
Experience with Datadog or similar monitoring and observability tools.
Excellent analytical and problem-solving skills, with the ability to diagnose and resolve issues.
Awareness of and commitment to adhering to security best practices.
Additional desirable qualifications:
Experience working with performance tuning strategies in high-volume transactional systems.
Familiarity with Infrastructure-as-Code (Terraform, CloudFormation, or CDK) and managing cloud resources in AWS.
Knowledge of serverless environments (Lambda, API Gateway, etc.) and related monitoring/debugging practices.
Experience configuring CI/CD pipelines (e.g., GitHub Actions, Bitbucket Pipelines, or similar) and deployment automation.
Familiarity with security operations tasks, such as IAM permissions reviews, least privilege enforcement, and security group audits.
Familiarity with compliance frameworks such as SOC 2 or PCI DSS.
Absolute minimum years of relevant experience required:
4 years
Does the role include supervisory responsibilities? If yes, provide details:
No
Duties and responsibilities of the role:
Platform Scalability and Development
Collaborate with engineering leadership to design and implement platform improvements that support long-term growth.
Write high-quality code to improve system stability and efficiency.
Lead initiatives to improve the scalability, reliability, and performance of our platform.
Proactively investigate and track down root causes of performance bottlenecks before they impact customers.
Conduct load and stress testing to validate scalability and identify weak points.
Deliver enhancements, bug fixes, and stability-focused features directly in the platform codebase.
Monitoring & Observability:
Manage and improve monitoring, alerting, and logging systems to ensure early detection of anomalies.
Define and maintain key KPIs, SLIs, and SLOs for platform health and user experience.
Incident Responses & L3 Support
Partner with L3 support to investigate and resolve complex production issues.
Collaborate with cross-functional teams (development, security, product) to diagnose and address the root causes of platform incidents.
Lead post-incident reviews and drive remediation strategies to prevent recurrence.
Infrastructure Management:
Actively contribute to the design, implementation, and evolution of AWS-based infrastructure.
Champion automation, infrastructure-as-code, and continuous delivery practices.
Manage capacity planning and tuning efforts to support platform scalability and cost efficiency.
Security:
Assist in implementing and maintaining security controls and best practices across the platform, including access management, data encryption, and network segmentation.
Collaborate with the security team to implement incident response plans and participate in security incident investigations and resolution.
Work environment:
Physical demands of the position:
While performing the duties of this job, the employee may be regularly required to stand, sit, talk, hear, reach, stoop, kneel, and use hands and fingers to operate a computer, telephone, and keyboard.
Job Type: Full-time
Pay: $100,000.00-$120,000.00 per year
Ability to commute/relocate:
Vancouver, BC (V7X): reliably commute or plan to relocate before starting work (preferred)
Experience:
relevant: 4 years (required)
Location:
Vancouver, BC (V7X) (preferred)
Work Location: In person
Beware of fraud agents! do not pay money to get a job
MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.