Sr. Engineering Manager - SRE
About Us:
Tata Digital is a future-ready company that focuses on creating consumer-centric, high-engagement digital products. By creating a holistic presence across various touchpoints, we aim to be the trusted partner of every consumer and delight them by powering a rewarding life. The company's debut offering, Tata Neu is a super-app that provides an integrated rewards experience across various consumer categories like groceries, fashion and electronics, travel and hospitality, health and fitness, entertainment, and financial services on a single platform. Founded in March 2019, Tata Digital Private Limited is a wholly owned subsidiary of Tata Sons Private Limited.
Our Culture:
We cultivate a culture of innovation, inclusion for all employees and respect their individual strengths, views, and experiences. We thrive on the diversity of our talent in all forms and see it as a strength in building high performance teams across brands. As we rewrite commerce in India, change is the only constant in our day to day lives.
Role Overview:
We are looking for a Sr. Engineering Manager - SRE to oversee the stability, scalability, and delivery of our production environment, leveraging software engineering principles and automation to improve cloud infrastructure management and reduce operational costs. This role will play a key part in transitioning from manual processes to automated solutions by leading our current DevOps teams:
- Cloud Infra Lifecycle Management Team: Focused on automated provisioning, capacity planning, and maintenance across all cloud platforms for production applications.
- Cloud Infra Support Team: Responsible for supporting internal users with production and development environment requests, with a long-term goal of eliminating manual intervention through automation.
This role is ideal for a leader with a deep understanding of Azure cloud environments, SRE best practices, and a strong background in building automation-first operational models.
Key Responsibilities:
Stability, Scalability & Availability:
- Lead the design and implementation of strategies to ensure high availability, reliability, and performance of production systems.
- Apply lifecycle management techniques, including monitoring, capacity planning, and automated scaling, to cloud environments.
- Establish Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for critical applications.
Cloud Lifecycle Management:
- Oversee the Cloud Infra Lifecycle Management Team to build scalable, automated cloud provisioning workflows and optimize capacity.
- Implement infrastructure-as-code (IaC) practices using tools like Terraform, PowerShell, and Azure Resource Manager (ARM) templates.
- Ensure efficient cloud resource utilization and cost management strategies.
Cloud Support Operations:
- Manage the Cloud Infra Support Team responsible for handling internal user requests related to production and development environments.
- Develop efficient workflows for incident response and request resolution, with automation as the default approach.
- Work towards eliminating the need for manual support teams by creating self-service solutions for internal users.
Automation & Transformation:
- Lead the transition of manual processes to cloud automation through training, upskilling, and process reengineering.
- Champion the use of automation to handle repetitive operational tasks, including monitoring, remediation, and deployments.
- Foster a "first principles thinking" culture focused on engineering excellence and process simplification.
Monitoring & Incident Response:
- Build robust monitoring systems using Azure Monitor, Log Analytics, and Application Insights for proactive performance management.
- Oversee incident response processes, ensuring rapid recovery and root cause analysis for production disruptions.
- Implement disaster recovery and high-availability strategies across environments.
Security & Compliance:
- Ensure all environments follow cloud security best practices, regulatory compliance, and corporate governance policies.
- Manage identity and access controls, network security, and risk mitigation strategies.
Continuous Improvement:
- Drive ongoing improvements in system resilience, operational efficiency, and service quality through automation and best practices.
- Conduct regular performance reviews and capacity planning exercises to maintain optimal system health.
Team Leadership & Development:
- Provide coaching and mentorship to the SRE team, fostering a culture of continuous learning and technical excellence.
- Lead efforts to upskill the team in cloud scripting, automation development, and site reliability best practices.
Reporting & Metrics:
- Maintain detailed operational documentation and generate regular reports on system performance, reliability improvements, and cost efficiency efforts.
Basic Qualifications:
- 10+ years of experience in cloud operations or SRE, with a strong focus on Azure environments.
- Extensive experience in managing and optimizing Azure services like Virtual Machines, App Services, SQL Database, Networking, and Storage.
- Hands-on expertise with cloud automation and IaC tools (Terraform, PowerShell, ARM templates, or Azure Automation).
- Strong understanding of SRE principles, including error budgets, SLOs, SLIs, and incident management practices.
- Proficiency with Azure DevOps and CI/CD pipeline management.
- Expertise in cloud cost management and optimization.
- Familiarity with monitoring, logging, and observability tools (e.g., Azure Monitor, Log Analytics, Security Centre).
- Knowledge of Azure security practices, including identity and access management, firewalls, and compliance requirements.
Preferred Qualifications:
- Microsoft Certified: Azure Solutions Architect Expert or Azure Administrator Associate.
- Experience managing hybrid or multi-cloud environments.
- Experience implementing self-service workflows and internal user support automation.
Soft Skills:
- Strong leadership and team management abilities.
- Excellent communication and client engagement skills.
- Analytical mindset with a proactive approach to problem-solving.
- Ability to handle high-pressure situations with professionalism.