Lead Site Reliability Engineer (sre)

Remote, CA, Canada

Job Description

At EPAM, we're not just building software -- we're engineering excellence.


We're looking for a

Lead Site Reliability Engineer (SRE)

with a passion for performance, precision, and proactive problem-solving to join a high-impact team supporting a leading sell-side trading environment.


This role is ideal for someone who thrives in fast-paced financial systems, has a passion for working with data and monitoring tools, and wants to shape the reliability and efficiency of next-generation trading platforms.


The Site Reliability Engineer will focus on ensuring stable connectivity to external partners within a SaaS environment. The ideal candidate will have expertise in financial systems, especially within trading ecosystems, and the ability to proactively drive performance enhancements and improve data usage and analysis. By identifying areas of opportunity, they will help deliver improved service and systems for end users.


Additionally, the candidate will help proactively identify system issues, implement changes and resolutions, and ensure the stability of business-critical applications. They will collaborate to build actionable plans, execute strategies, and lead initiatives to enhance system reliability.

Responsibilities



Provide a strategic vision for trading portfolio performance, covering network connectivity, traffic throughput, and applications Define, configure, and set up alerting and monitoring frameworks for critical applications Monitor application and platform performance using APM and monitoring tools to diagnose and resolve performance issues Collaborate with Azure Cloud environments and contribute to a 24x7x365 support team to diagnose and address system challenges Assess environmental and incident priorities, investigate issues swiftly, and execute efficient resolutions Troubleshoot mission-critical systems and implement preventative problem management solutions Lead on promoting observability, scalability, and resiliency best practices across development and operations teams Analyze, design, and implement solutions to meet application performance and reliability goals Collaborate with cross-functional teams to ensure smooth and unified troubleshooting and resolution processes across departments Craft and maintain SLA/SLO dashboards to monitor system health and performance Define and maintain SLIs, SLOs, and error budgets for applications and infrastructure to drive service improvement Automate operational processes to enhance service offerings and system reliability

Requirements



5+ years of experience in site reliability engineering, production support, or related roles in fast-paced environments Showcase of leadership or mentoring experience (minimum of 1 year) in guiding cross-functional teams on system reliability Knowledge of monitoring and observability tools such as AppDynamics, New Relic, Prometheus, or Grafana Background in Azure Cloud services, CI/CD pipelines, and container orchestration (Kubernetes or Docker) Proficiency in scripting with Python, Bash, or PowerShell for automation and efficiency gains Understanding of network protocols (TCP/IP, DNS, HTTP) and troubleshooting tools such as Wireshark or tcpdump Capability to analyze complex system issues and performance bottlenecks using APM and log analysis Familiarity with implementing SLA/SLO metrics and monitoring for production systems Combined skills in high-availability systems and database performance optimization

Nice to have



Expertise in SaaS solutions and APIs with a focus on handling external trading partners Knowledge of disaster recovery strategies and business continuity planning Background in trading platforms or buy-side/sell-side financial environments

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our clients, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Engineer the Future with a Career at EPAM


EPAM Canada welcomes and encourages applications from candidates with disabilities. Please contact WFA Human Resource CA WFAHRCA@epam.com if you have questions in this regard, or if you require an accommodation to complete the application process. Click here to review EPAM's Accessibility for Ontarians with Disabilities Accessibility Policies and Multi-Year Access.

Beware of fraud agents! do not pay money to get a job

MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Related Jobs

Job Detail

  • Job Id
    JD3045714
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Remote, CA, Canada
  • Education
    Not mentioned