We are seeking a Site Reliability Engineer Team Lead to oversee and drive the availability, performance, and scalability of our systems and services. In this leadership role, you will guide and mentor a team of SREs while collaborating closely with development teams to design and implement robust, reliable solutions. You will be responsible for leading efforts to enhance and streamline automation infrastructure, spearheading automation initiatives to streamline operations, and continuously optimizing system performance to meet evolving needs. Additionally, you will play a crucial role in strategic planning to ensure our infrastructure can support growth and adapt to changing technological demands.
Role:
Lead a team of SREs in maintaining and enhancing system reliability and performance. Develop strategic plans to meet and exceed established SLA/SLOs and drive initiatives that align with business objectives.
Oversee the implementation and optimization of monitoring systems to detect anomalies and deviations in real-time. Ensure the team continuously reviews metrics and trends to proactively address emerging issues before they affect users.
Champion the development and implementation of automation tools and processes. Drive efforts to improve operational efficiency, minimize manual intervention, and eliminate repetitive tasks across the team.
Lead capacity planning and performance tuning initiatives. Oversee resource utilization monitoring, and work with your team to forecast future needs and ensure systems can handle anticipated loads.
Collaborate closely with development teams to embed reliability best practices into system design and feature implementation. Provide expert guidance on system architecture, deployment strategies, and reliability engineering principles.
Identify and spearhead opportunities for system and process improvements. Promote initiatives that enhance system reliability, scalability, and performance, ensuring that the team is always pushing the boundaries of excellence.
Role:
Lead a team of SREs in maintaining and enhancing system reliability and performance. Develop strategic plans to meet and exceed established SLA/SLOs and drive initiatives that align with business objectives.
Oversee the implementation and optimization of monitoring systems to detect anomalies and deviations in real-time. Ensure the team continuously reviews metrics and trends to proactively address emerging issues before they affect users.
Champion the development and implementation of automation tools and processes. Drive efforts to improve operational efficiency, minimize manual intervention, and eliminate repetitive tasks across the team.
Lead capacity planning and performance tuning initiatives. Oversee resource utilization monitoring, and work with your team to forecast future needs and ensure systems can handle anticipated loads.
Collaborate closely with development teams to embed reliability best practices into system design and feature implementation. Provide expert guidance on system architecture, deployment strategies, and reliability engineering principles.
Identify and spearhead opportunities for system and process improvements. Promote initiatives that enhance system reliability, scalability, and performance, ensuring that the team is always pushing the boundaries of excellence.
Requirements:
5+ years of experience in Site Reliability Engineering, DevOps, or related roles, with a proven track record of leading teams and projects.
Deep proficiency in AWS cloud platforms, Kubernetes, and scripting languages (e.g., Python, Bash). Extensive experience with system administration, configuration management tools (e.g., Ansible, Puppet, Chef), and monitoring/logging tools (e.g., Prometheus, Grafana, ELK stack).
Strong understanding of incident management processes and best practices, with experience leading incident response and resolution efforts.
Expertise in automation tools and practices for deployment and infrastructure management. Demonstrated ability to implement and advocate for effective automation strategies.
Exceptional communication and collaboration skills. Proven ability to lead, mentor, and work effectively within a team environment, driving a culture of teamwork and continuous learning.
Strong analytical and problem-solving abilities. Capable of troubleshooting complex issues and guiding the team through resolution.
Preferred Qualifications:
Relevant certifications such as AWS Certified Solutions Architect or Google Professional Data Engineer.
Familiarity with advanced topics like distributed systems, microservices architecture, and network protocols.
5+ years of experience in Site Reliability Engineering, DevOps, or related roles, with a proven track record of leading teams and projects.
Deep proficiency in AWS cloud platforms, Kubernetes, and scripting languages (e.g., Python, Bash). Extensive experience with system administration, configuration management tools (e.g., Ansible, Puppet, Chef), and monitoring/logging tools (e.g., Prometheus, Grafana, ELK stack).
Strong understanding of incident management processes and best practices, with experience leading incident response and resolution efforts.
Expertise in automation tools and practices for deployment and infrastructure management. Demonstrated ability to implement and advocate for effective automation strategies.
Exceptional communication and collaboration skills. Proven ability to lead, mentor, and work effectively within a team environment, driving a culture of teamwork and continuous learning.
Strong analytical and problem-solving abilities. Capable of troubleshooting complex issues and guiding the team through resolution.
Preferred Qualifications:
Relevant certifications such as AWS Certified Solutions Architect or Google Professional Data Engineer.
Familiarity with advanced topics like distributed systems, microservices architecture, and network protocols.
This position is open to all candidates.












