Site Reliability Engineer

Dallas, Texas


Employer: Apex Systems
Industry: 
Salary: Competitive
Job type: Part-Time

Demonstrates extensive abilities and/or a proven record of success in the following areas:

Providing SRE support for multiple distributed software applications (client-facing - internal & external);

Managing and continually improving platform infrastructure and applications with high reliability, resiliency, performance & quality, and faster time-to-market taking a holistic view of system health into account;

Gathering and analyzing metrics from both systems and applications for performance tuning and fault finding;

Partnering with development teams to improve services through rigorous testing and release procedures meeting security, compliance & performance requirements;

Participating in systems design, platform management, and capacity planning. Ensure that platforms are designed with "operability " in mind;

Pursuing the discovery of system faults throughout the application lifecycle - before & after release;

Defining, Implementing and being accountable for Velocity & Reliability (SLIs, SLOs, Error Budgets);

Creating & supporting sustainable systems and services through automation (to drive the problems away not just mere automation) and uplifts for infrastructure, testing, failover solutions, failure mitigation, etc.;

Writing, updating, and using documentation, including runbooks/playbooks; and,

Using Chaos Engineering to test the robustness of the systems and applications.

Qualifications

5+ years professional experience with various flavors of Linux and/or Windows

5+ years experience in supporting and troubleshooting full stack applications (monolithic and microservices), infrastructure and legacy applications (root cause analysis through identifying, analyzing and remediating service(s) performance and availability issues to ensure maximum service uptime and availability)

5+ years experience with cloud computing technology and its concepts (Azure, AWS, GCP)

3+ years experience in balancing service reliability, metrics, sustainability, technical debt, and operational toil for live services running at scale

3+ years experience with container technologies and orchestration (Docker, Kubernetes-AKS, EKS, GKE)

3+ year implementing DevOps practices at scale

Demonstrates extensive abilities and/or a proven record of success in the following areas:

Experience in one or more of the following: Go, Python, Ruby, Java, Perl, Shell, or Powershell;

Experience with CI/CD tool chain- Git, Jenkins, Azure DevOps. Veracode, SonarQube, JFrog Artifactory;

Experience with IaC with Terraform, ARM templates, and/or AWS CloudFormation templates;

Experience with configuration management tools like Ansible, Puppet and/or Chef;

Experience with DBaaS/Managed Cloud database technologies such as CosmosDB, DynamoDB, Managed SQL (RDS, SQL Database), In-memory (Cache for Redis, ElastiCache);

Experience with application performance monitoring tools (AppDynamics, Azure application insights, Dynatrace, or Datadog) and log management tools (Azure Monitor's log analytics, Elastic Stack, and/or Splunk) defining, creating and configuring metrics for dashboards and alerts;

Experience with distributed storage technologies like Azure (Blob, Files, Tables), S3, NFS, HDFS;

Experience with Web server technologies- HTTP, Nginx, Apache, Tomcat;

Experience in Kafka, Azure Event hubs or similar message queue technologies;

Experience with Service mesh platforms such as Istio, Hashicorp Consul;

Experience with Secrets Lifecycle management (Azure Keyvault, Hashicorp Vault);

Experience on minimal or near zero downtime deployments as Blue-Green, Canary, rolling upgrades, etc.;

Define and implement HA, DR and rollback strategies along with the product and build teams;

Possess proficiency in Networking concepts (HTTP/S, TCP/IP, DNS, Virtual Networks (VNet, VPC), Subnets, Routing, Firewalls, and Network Security, triaging packet loss etc) and knowledge on RESTful APIs;

Experience with 24x7x365 monitoring, incident response and oncall support;

Experience in troubleshooting that spans systems, network, and code;

Experience determining & negotiating Error budgets, SLIs, SLOs, and SLAs with product owners;

Demonstrate systematic problem-solving approach, coupled with solid communication skills;

Demonstrate the ability to work independently and as a member of a greater team, including cross-team activities; and,

Experience working in Agile Scrum, Kanban methodologies in SDLC.

Preferred Qualifications'

Demonstrates extensive abilities and/or a proven record of success in the following areas:

Demonstrating experience within development of the complete application stack inclusive of software engineering and systems engineering responsibilities (e.g. full-stack development); Requirement gathering, validation, fulfillment and change management Infrastructure operations experience including self-healing autonomy; working within regulatory frameworks such as SOX, SOC2, etc.; Experience in Chaos engineering; Experience with integration technologies like SnapLogic; Experience with a variety of databases and basic DBA skills (MySQL, SQL Server, Oracle, Postgres, Redis, Couchbase and/or Cassandra).

*It's not expected that the candidate would have expertise across all of these areas - we're looking for candidates that are particularly strong in most areas and have some interest and capabilities in others.

EEO Employer

Apex Systems is an equal opportunity employer. We do not discriminate or allow discrimination on the basis of race, color, religion, creed, sex (including pregnancy, childbirth, breastfeeding, or related medical conditions), age, sexual orientation, gender identity, national origin, ancestry, citizenship, genetic information, registered domestic partner status, marital status, disability, status as a crime victim, protected veteran status, political affiliation, union membership, or any other characteristic protected by law. Apex will consider qualified applicants with criminal histories in a manner consistent with the requirements of applicable law. If you have visited our website in search of information on employment opportunities or to apply for a position, and you require an accommodation in using our website for a search or application, please contact our Employee Services Department at [email protected] or 844-463-6178.

Apex Systems is a world-class IT services company that serves thousands of clients across the globe. When you join Apex, you become part of a team that values innovation, collaboration, and continuous learning. We offer quality career resources, training, certifications, development opportunities, and a comprehensive benefits package. Our commitment to excellence is reflected in many awards, including ClearlyRated's Best of Staffing® in Talent Satisfaction in the United States and Great Place to Work® in the United Kingdom and Mexico.

Created: 2024-06-28
Reference: 1281922
Country: United States
State: Texas
City: Dallas
ZIP: 75287


Similar jobs: