Role: Senior Manager, Cloud Infrastructure amp; DevOps
Reports to: Chief Technology Officer (CTO)
Company: Coconut Software, https://www.coconutsoftware.com/
Location: REMOTE - can be located anywhere in Canada
Type: Full Time / Permanent
We are looking for an experienced Infrastructure amp; DevOps Manager to lead our Cloud Infrastructure, SRE (Site Reliability Engineering) amp; DevOps Team. We’re looking for a strong DevOps amp; Infrastructure professional who has experience leading and managing a team, coaching and delegating to direct reports, and has the ability to develop and own the strategic plan for infrastructure. This position is about optimizing efficiency and collaborating with the engineering team to ensure the development engine runs smoothly. You enjoy being a player/coach at a high-growth, fast paced company.
As the Senior DevOps Manager, you will bring with a breadth of technical knowledge, experience defining and implementing CloudOps, and knowledge of DevOps and SRE best practices. You will be responsible for managing and optimizing Coconut’s DevOps and infrastructure, (based on AWS, Kubernetes, and Terraform), and building tools to automate the management of a stable, efficient, observable, and resilient technology environment.
Reporting to the CTO, this role functions with a high level of autonomy. It requires a proactive individual who loves to dive deep into projects while understanding the importance of open and clear communication within and between teams.
Responsibilities:
Team Leadership:
- Lead, mentor, coach and inspire a team of DevOps, Infrastructure and SRE professionals
- Foster a collaborative and high-performance work environment
- Hire and train team members
- Own the Infrastructure and DevOps roadmap
Infrastructure Management:
- Design, implement, and manage a cloud-based infrastructure ecosystem for scalability and reliability
- Implement best practices for infrastructure as code (IaC) and configuration management
- Work closely with development teams to ensure a manageable migration into a secure and reliable product environment and on implementing new tools
- Set strategic plans and priorities for this function
- This position will participate in escalations in our "On-Call" team rotation roster
Automation and CI/CD:
- Develop and maintain automated deployment pipelines
- Promote continuous integration and delivery (CI/CD) practices
Monitoring and Alerting:
- Implement robust monitoring to proactively identify and resolve issues
- Configure and manage alerting systems for real-time status and incident response
- Design and develop automation and processes to enable teams to deploy, manage, configure, scale and monitor applications
Reliability Engineering:
- Define and measure service level objectives (SLOs) and service level indicators (SLIs) to ensure system reliability
- Lead incident response and post-incident reviews to improve system resilience
- Develop innovative and technical tooling to improve production stability and enable faster recovery
Security and Compliance:
- Collaborate with security teams to implement best practices for securing infrastructure and applications
- Ensure compliance with industry standards and regulations
Resource Optimization:
- Optimize resource utilization to reduce costs while maintaining performance and reliability
- Monitor amp; report on hosting amp; tooling costs
Documentation and Training:
- Maintain comprehensive documentation of systems, processes, and procedures
- Provide training and knowledge sharing within the team
Requirements
- Bachelor’s degree in Computer Science, Information Technology, or a related field
- Proven experience managing/leading a DevOps/SRE/Infrastructure team in a fast-paced environment
- Strong expertise in cloud platforms and infrastructure management, preferably AWS and Kubernetes
- Experience with provisioning, vendor management, and monitoring resources in a cloud based environment
- Experience configuring and managing data sources, such as MySQL, Postgres, Redis
- System configuration experience with automation tools such as Puppet, Chef, Ansible
- Proficiency in automation and CI/CD tools such as Spinnaker, CircleCI, Travis CI, or GitLab CI/CD
- Experience with containerization and orchestration techniques and tools (ie. Docker, Kubernetes)
- Experience with infrastructure as code tools, such as HashiCorp Terraform or Cloudformation
- Expertise leading amp; analyzing complex application, database,