Site Reliability Engineer Lead
Dropsuite
Lokasi
Bandung
Tipe kerja
Hybrid
Gaji
-
Deskripsi pekerjaan
Nice to Meet You! We are Dropsuite, a NinjaOne Company!
As an SRE Lead, you will be responsible for leading a team of Site Reliability Engineers in designing, implementing, and maintaining highly scalable and reliable systems. You will collaborate with cross-functional teams, including software engineers and system administrators, to ensure the seamless operation of critical production services.
Work Arrangement
- Full-time position
- Hybrid work model (2 days per week in the office)
- Monday to Friday, 5-day work week (flexible work schedule)
- Eligible to reside and work in Bandung (Indonesian citizenship only)
Key Accountabilities
Team Leadership
- Lead, mentor, and guide a team of Site Reliability Engineers.
- Foster a collaborative and innovative team culture.
- Provide technical guidance and support for team members across all activities of the SiteOps team
- Be accountable for the work outcomes of the SRE team including production uptime and optimization projects
- Own Operational projects and Network Operations Centre (NOC). Implement SOP for the NOC to derive maximum coverage.
System Architecture and Design
- Collaborate with software engineering teams to design and implement scalable and reliable systems.
- Participate in code reviews to ensure adherence to DevOps / SRE best practices.
- Work closely with system administrators, network engineers, and security teams to ensure a holistic approach to system reliability.
- Manage and version control infrastructure configurations.
Automation and Tooling
- Develop and maintain automation scripts and tools to streamline operational tasks.
- Develop and maintain automation tools for infrastructure provisioning, configuration management, and deployment (Terraform or Ansible)
- Implement monitoring and alerting solutions to proactively identify and address potential issues.
- Evaluate, implement, and manage DevOps / SRE -related tools for configuration management, monitoring, and logging.
Incident Management
- Lead incident response efforts, ensuring timely resolution of production issues.
- Conduct post-incident reviews and implement improvements to prevent future incidents.
Performance Optimization
- Analyse system performance and implement optimizations to enhance reliability and efficiency.
- Work on capacity planning to accommodate future growth.
Security and Compliance
- Work with security teams to implement and enhance security measures in the DevOps pipeline.
- Ensure compliance with industry standards and regulatory requirements.
- Ensure adherence to established security standards across all environments both with the engineering and the SiteOps teams
Documentation
- Maintain and update documentation related to system architecture, processes, and best practices.
On-call Support
- Participate in an on-call rotation schedule to provide 24/7 support for production systems.
Tanggung jawab
Nice to Meet You! We are Dropsuite, a NinjaOne Company!
As an SRE Lead, you will be responsible for leading a team of Site Reliability Engineers in designing, implementing, and maintaining highly scalable and reliable systems. You will collaborate with cross-functional teams, including software engineers and system administrators, to ensure the seamless operation of critical production services.
Work Arrangement
- Full-time position
- Hybrid work model (2 days per week in the office)
- Monday to Friday, 5-day work week (flexible work schedule)
- Eligible to reside and work in Bandung (Indonesian citizenship only)
Key Accountabilities
Team Leadership
- Lead, mentor, and guide a team of Site Reliability Engineers.
- Foster a collaborative and innovative team culture.
- Provide technical guidance and support for team members across all activities of the SiteOps team
- Be accountable for the work outcomes of the SRE team including production uptime and optimization projects
- Own Operational projects and Network Operations Centre (NOC). Implement SOP for the NOC to derive maximum coverage.
System Architecture and Design
- Collaborate with software engineering teams to design and implement scalable and reliable systems.
- Participate in code reviews to ensure adherence to DevOps / SRE best practices.
- Work closely with system administrators, network engineers, and security teams to ensure a holistic approach to system reliability.
- Manage and version control infrastructure configurations.
Automation and Tooling
- Develop and maintain automation scripts and tools to streamline operational tasks.
- Develop and maintain automation tools for infrastructure provisioning, configuration management, and deployment (Terraform or Ansible)
- Implement monitoring and alerting solutions to proactively identify and address potential issues.
- Evaluate, implement, and manage DevOps / SRE -related tools for configuration management, monitoring, and logging.
Incident Management
- Lead incident response efforts, ensuring timely resolution of production issues.
- Conduct post-incident reviews and implement improvements to prevent future incidents.
Performance Optimization
- Analyse system performance and implement optimizations to enhance reliability and efficiency.
- Work on capacity planning to accommodate future growth.
Security and Compliance
- Work with security teams to implement and enhance security measures in the DevOps pipeline.
- Ensure compliance with industry standards and regulatory requirements.
- Ensure adherence to established security standards across all environments both with the engineering and the SiteOps teams
Documentation
- Maintain and update documentation related to system architecture, processes, and best practices.
On-call Support
- Participate in an on-call rotation schedule to provide 24/7 support for production systems.
Kualifikasi
Qualifications and Competencies
- Bachelor’s degree in computer science, Information Technology, or a related field.
- Proven experience as a Site Reliability Engineer / DevOps leader or in a similar role.
- In-depth knowledge of cloud computing platforms (e.g., AWS, Azure, GCP).
- Strong leadership and communication skills.
- Proficiency in programming/scripting languages (e.g., Python, Shell, Ruby).
- Proficiency in System administration of production servers
- Experience with container orchestration tools (e.g., Kubernetes, Docker).
- Familiarity with infrastructure as code (e.g., Terraform, Ansible).
- Expertise in Build and release strategies and able to implement the right strategy for the team
- Expertise in monitoring and logging tools (e.g., Datadog, Zabbix, Prometheus, Grafana, etc).
- Solid understanding of networking concepts and protocols.
- Understanding of security best practices in DevOps processes.
- Excellent problem-solving and communication skills.
- Ability to work independently and collaboratively in a team environment.