Responsibilities:
- Build fully automated cloud infrastructure using Terraform
- Design self-healing and auto-scaling systems to reduce downtime and operational overhead
- Support and unblock development teams by resolving pipeline and helm issues in timely manner
Collaborate with the CCoE and Kong teams to troubleshoot and resolve issues, ensuring timely restoration of services and minimal disruption.
- Conduct training sessions to help developers analyze alerts and effectively monitor their workloads
- Collaborate with cross-functional teams to design and configure / maintain network policies that enable secure and reliable connectivity to on-premises services
Innovation
- Adopt policy-as-code frameworks (e.g., OPA) to automate GCP compliance checks and governance enforcement.
- Implement serverless architecture patterns (e.g., Cloud Functions) to modernize legacy workflows and reduce operational overhead.
- Conduct technical proof-of-concepts (POCs), evaluated and recommended suitable GCP products, and enabled development teams to successfully implement cloud-native solutions.
Cost Optimization & Governance
- Decommission unused AWS resources and migrate DataHub export workflows from Amazon S3 to Google Cloud Storage, resulting in optimized storage costs and a more unified data infrastructure.
Decommission legacy GCP projects (casa-preproduction & casa-production) by migrating data lake exports, streamlining resource management and eliminating redundant costs.
Business Impact
- Black Friday Readiness
- Drive load testing efforts to replicate peak traffic scenarios, analyze alerts and performance metrics, and tailor infrastructure configurations to ensure optimal scalability and reliability
Resource Optimization:
- Proactively manage workloads by right-sizing memory, CPU, and replica configurations to ensure high availability and performance under peak traffic conditions.
- Configure cluster autoscaling to dynamically adjust infrastructure capacity based on demand, improving efficiency and reducing operational overhead.
Observability Enhancements:
- Leverage Dynatrace Synthetics and user session metrics to monitor real-user performance and identify potential bottlenecks.
- Implement custom alert configurations to detect anomalies and reduce mean time to resolution (MTTR).
GCP Monitoring & Alerting:
- Develop custom log-based alerts integrated with Slack and ITSM tools, enabling automated incident creation and streamlined response workflows.
Security & Compliance
- Architect secure environments leveraging IAM policies, VPC configurations, and encryption standards in alignment with industry regulations
- Build and maintain centralized GitHub Actions pipelines, enabling smooth, compliant production deployments.
- Conduct security reviews and integrate DevSecOps practices into CI/CD workflows to shift left on security.