Job Description
We are seeking a proactive Site Reliability Engineer (SRE) to drive reliability, performance, and efficiency across our systems and platforms. You'll work closely with Application Development, QA, Product, and Data Engineering teams to champion a DevOps/SRE culture rooted in automation, observability, and continuous improvement.
Key Responsibilities:
- Collaborate cross-functionally to promote SRE and DevSecOps best practices across the organization.
- Build and maintain reliable, scalable systems with a focus on availability, performance, and resiliency.
- Establish and monitor SLOs/SLIs, and develop comprehensive dashboards to support decision-making from both technical and business perspectives.
- Lead efforts to reduce toil through automation, self-healing systems, and advanced monitoring (e.g., synthetic monitoring, RUM).
- Apply observability and reliability testing practices from architecture through operations, leveraging Agile and product-based models.
- Drive the adoption of cutting-edge tools in observability, automation, platform engineering, AIOps, and MLOps.
- Contribute to and lead Communities of Practice (CoP) and SRE Office Hours to foster knowledge sharing and continuous improvement.
Qualifications:
SRE & DevOps Expertise:
- Strong experience in observability, toil reduction, incident response, and performance optimization.
- Proficient with monitoring tools such as Dynatrace, CloudWatch, and Azure Monitor.
- Skilled in IaC, CaC, JSON, and scripting with Python, Node.js, Ruby, PowerShell, and Shell.
- Deep understanding of Dynatrace advanced features: DT Guardian, RUM, Synthetic Monitoring, AI-based event correlation.
Cloud & Automation:
- Expert in AWS Cloud services: CDK, Lambda, CloudWatch, EKS, EC2, ELB, S3, SSM.
- Experience with log ingestion pipelines (AWS Firehose, Dynatrace OpenPipeline), and operational dashboards.
- Hands-on experience with Ansible Tower, AWS SSM, Bitbucket/GitHub, and CI/CD workflows.
Orchestration & Data:
- Familiarity with orchestration tools like Step Functions, Apache Airflow, and container platforms.
- Knowledge of data pipelines, data lakes, and databases (Redshift, RDS, Aurora, PostgreSQL, SQL Server, Oracle).
Leadership & Communication:
- Strong problem-solving and knowledge management skills.
- Effective communicator who bridges technical and business teams.
- Collaborative, inclusive leader who builds high-performing teams and fosters a culture of growth and recognition.