•
Manage and optimize infrastructure across AWS and Azure, with a focus on security, scaling, and cost efficiency
•
Designing and implementing robust monitoring, alerting, and remediation frameworks to achieve industry-leading uptime
•
Automating infrastructure and platform operations using Infrastructure as Code (Terraform), CI/CD pipelines using Azure DevOps, Jenkins, and scripting to reduce manual effort and mean time to recovery (MTTR)
•
Partnering with development, QA, and operations teams to embed reliability and security into service design from code release through production
•
Leading on-call rotations and post-incident reviews with a focus on root-cause analysis and continuous improvement of service health
•
Configure monitoring and logging solutions like Prometheus, Grafana, Dynatrace, and Loggly for proactive alerting and visibility
•
Maintain documentation for systems, standards, troubleshooting steps, and infrastructure operations
•
Mentoring junior SREs, defining team standards, and driving improvements in reliability engineering practices as the platform evolves