Site Reliability Engineer Lead

Dropsuite

Lokasi

Bandung

Tipe kerja

Hybrid

Gaji

Deskripsi pekerjaan

Nice to Meet You! We are Dropsuite, a NinjaOne Company!

As an SRE Lead, you will be responsible for leading a team of Site Reliability Engineers in designing, implementing, and maintaining highly scalable and reliable systems. You will collaborate with cross-functional teams, including software engineers and system administrators, to ensure the seamless operation of critical production services.

Work Arrangement

Full-time position
Hybrid work model (2 days per week in the office)
Monday to Friday, 5-day work week (flexible work schedule)
Eligible to reside and work in Bandung (Indonesian citizenship only)

Key Accountabilities

Team Leadership

Lead, mentor, and guide a team of Site Reliability Engineers.
Foster a collaborative and innovative team culture.
Provide technical guidance and support for team members across all activities of the SiteOps team
Be accountable for the work outcomes of the SRE team including production uptime and optimization projects
Own Operational projects and Network Operations Centre (NOC). Implement SOP for the NOC to derive maximum coverage.

System Architecture and Design

Collaborate with software engineering teams to design and implement scalable and reliable systems.
Participate in code reviews to ensure adherence to DevOps / SRE best practices.
Work closely with system administrators, network engineers, and security teams to ensure a holistic approach to system reliability.
Manage and version control infrastructure configurations.

Automation and Tooling

Develop and maintain automation scripts and tools to streamline operational tasks.
Develop and maintain automation tools for infrastructure provisioning, configuration management, and deployment (Terraform or Ansible)
Implement monitoring and alerting solutions to proactively identify and address potential issues.
Evaluate, implement, and manage DevOps / SRE -related tools for configuration management, monitoring, and logging.

Incident Management

Lead incident response efforts, ensuring timely resolution of production issues.
Conduct post-incident reviews and implement improvements to prevent future incidents.

Performance Optimization

Analyse system performance and implement optimizations to enhance reliability and efficiency.
Work on capacity planning to accommodate future growth.

Security and Compliance

Work with security teams to implement and enhance security measures in the DevOps pipeline.
Ensure compliance with industry standards and regulatory requirements.
Ensure adherence to established security standards across all environments both with the engineering and the SiteOps teams

Documentation

Maintain and update documentation related to system architecture, processes, and best practices.

On-call Support

Participate in an on-call rotation schedule to provide 24/7 support for production systems.

Tanggung jawab

Nice to Meet You! We are Dropsuite, a NinjaOne Company!

Work Arrangement

Full-time position
Hybrid work model (2 days per week in the office)
Monday to Friday, 5-day work week (flexible work schedule)
Eligible to reside and work in Bandung (Indonesian citizenship only)

Key Accountabilities

Team Leadership

Lead, mentor, and guide a team of Site Reliability Engineers.
Foster a collaborative and innovative team culture.
Provide technical guidance and support for team members across all activities of the SiteOps team
Be accountable for the work outcomes of the SRE team including production uptime and optimization projects
Own Operational projects and Network Operations Centre (NOC). Implement SOP for the NOC to derive maximum coverage.

System Architecture and Design

Collaborate with software engineering teams to design and implement scalable and reliable systems.
Participate in code reviews to ensure adherence to DevOps / SRE best practices.
Work closely with system administrators, network engineers, and security teams to ensure a holistic approach to system reliability.
Manage and version control infrastructure configurations.

Automation and Tooling

Develop and maintain automation scripts and tools to streamline operational tasks.
Develop and maintain automation tools for infrastructure provisioning, configuration management, and deployment (Terraform or Ansible)
Implement monitoring and alerting solutions to proactively identify and address potential issues.
Evaluate, implement, and manage DevOps / SRE -related tools for configuration management, monitoring, and logging.

Incident Management

Lead incident response efforts, ensuring timely resolution of production issues.
Conduct post-incident reviews and implement improvements to prevent future incidents.

Performance Optimization

Analyse system performance and implement optimizations to enhance reliability and efficiency.
Work on capacity planning to accommodate future growth.

Security and Compliance

Work with security teams to implement and enhance security measures in the DevOps pipeline.
Ensure compliance with industry standards and regulatory requirements.
Ensure adherence to established security standards across all environments both with the engineering and the SiteOps teams

Documentation

Maintain and update documentation related to system architecture, processes, and best practices.

On-call Support

Participate in an on-call rotation schedule to provide 24/7 support for production systems.

Kualifikasi

Qualifications and Competencies

Bachelor’s degree in computer science, Information Technology, or a related field.
Proven experience as a Site Reliability Engineer / DevOps leader or in a similar role.
In-depth knowledge of cloud computing platforms (e.g., AWS, Azure, GCP).
Strong leadership and communication skills.
Proficiency in programming/scripting languages (e.g., Python, Shell, Ruby).
Proficiency in System administration of production servers
Experience with container orchestration tools (e.g., Kubernetes, Docker).
Familiarity with infrastructure as code (e.g., Terraform, Ansible).
Expertise in Build and release strategies and able to implement the right strategy for the team
Expertise in monitoring and logging tools (e.g., Datadog, Zabbix, Prometheus, Grafana, etc).
Solid understanding of networking concepts and protocols.
Understanding of security best practices in DevOps processes.
Excellent problem-solving and communication skills.
Ability to work independently and collaboratively in a team environment.

Keahlian

Infrastructure & DevOpsDevOps ToolsDevOps & CloudDevOpsCloud & DevOps

Site Reliability Engineer Lead

Deskripsi pekerjaan

Tanggung jawab

Kualifikasi

Keahlian

Lowongan Terkait

Finance & Accounting

Cook

Staff Gudang