CareersHelp Build the Open Cloud

Site Reliability Engineer (SRE)

San Francisco, CA, US
Remote

At Joyent, engineers at every level directly influence our business and services. Our Service Reliability Engineers are a hybrid of software and systems engineers responsible for reliability, scalability, and automation while keeping an eye on latency, performance, and capacity.

Responsibilities

Change Management
  • Plan and execute software update on Joyent-managed clouds
  • Manage configuration changes and perform regular audit
  • Scale up/out services based on monitoring feedback to prevent overload
  • Plan and manage new service deployment, determine sizing and infrastructure resource needs
Day-to-day Monitoring Improvements
  • Identify monitoring gaps based on incident root-cause analysis
  • Identify monitoring needs for new/modified services
  • Work with monitoring system owners to implement new metrics, dashboards and alarms
  • Review alarm thresholds periodically with NOC team and finetune settings as appropriate
Standard Operating Procedures
  • Create/maintain SOP for recurring processes (alarm handling, audit, scaling)
  • Contribute to product operator guide and debugging documentation
On-call Support (24x7)
  • Act as the first escalation point for application issues
  • Work with the incident response team to restore services
  • Work rotating shifts and weekend schedules as required
Root Cause Analysis
  • Write RCA reports, work with engineering and operations teams to produce action plans
  • Assist and participate in the action plans, with a focus on mitigations for any systemic resiliency and reliability issues; champion improvement initiatives
  • Ensure RCA action plans are followed through

Operator Tooling
  • Identify “toil” and look for automation opportunities
  • Create/maintain operator tools; create produce enhancement requests to replace workarounds

Capacity/Usage Monitoring
  • Work with product teams on capacity usage threshold and future planning
  • Rectify underlying system issues which may have contributed to the capacity problems

Qualifications

You’ll be a natural fit if you are:
  • Passionate about building best-in-class services in the cloud computing market
  • Not afraid to disagree with others when the service quality does not meet your standards
  • Not content with pointing out issues for others to solve, but getting involved in solving the issues
  • Able to perform and make sound decisions under highly stressful situations
  • Actively involved in the open source community
Additionally, you should have most (or all!) of the following:
  • 3+ years experience in one of the following areas: software development, DevOps, SRE, cloud infrastructure QA
  • Experience with building and running a cloud platform in a production environment
  • Deep hands-on technical expertise in designing and deploying global Linux or Unix based systems
  • Hands-on experience with CI/CD pipeline implementation
  • Hands-on experience with monitoring solutions such as Prometheus and InfluxDB
  • Working knowledge with all aspects of cloud infrastructure services (compute, storage, network)

About Joyent

Joyent, a wholly-owned subsidiary of Samsung, is the open cloud company. With its Triton Kubernetes services and support, Joyent helps its customers build and operate modern cloud native applications across multiple clouds. Joyent’s Triton Private Regions provide low cost, dedicated cloud infrastructure that gives its customers the ability to own their data and control their cloud costs.

To apply, please submit a brief introduction, a copy of your resume, and a link to your Github or LinkedIn profile to jobs@joyent.com with Site Reliability Engineer (SRE) in the subject. Qualified applicants with criminal histories will be considered for the position in a manner consistent with the Fair Chance Ordinance.

View All Open Positions at Joyent

Get the Joyent Newsletter

Sign up for our quarterly newsletter with information about Joyent Triton, upcoming events, recent publications, and insight into the latest technologies we are working on.