Senior Site Reliability Engineer, Storage
At Joyent, engineers at every level directly influence our business and services. Our Site Reliability Engineering team is a hybrid of software and systems engineers responsible for reliability, scalability, and automation of our platforms.
A senior engineer wants:
To architect infrastructure and applications behind customer facing APIs with high availability, reliability, scalability, delivering a great customer experience.
To build systems that:
- Are low toil for operators
- Have elegant interfaces for users
- Deal with the complexities of rigorous business logic
- Supply adequate informational context for engineers to make good decisions about
- Transparently handle failures
To participate and thrive in a delivery-oriented, goal-centric culture.
- Contribute to the design and creation of CI/CD system
- Scaling up/out services based on monitoring feedback to prevent overload
- Plan and manage new service deployments
Day-to-day Monitoring Improvements:
- Identify monitoring gaps based on incident root-cause analysis
- Identify monitoring needs for new/modified services
- Work with monitoring system owners to implement new metrics, dashboards and alarms
Standard Operating Procedures:
- Create/maintain Runbook/Standard Operating Procedures for recurring processes (alarm handling, audit, scaling)
- Contribute to product operator guide and debugging documentation
On-call Support (24x7):
- Act as the first escalation point for application issues
- Work with the incident response team to restore services
- Work rotating shifts and weekend schedules as required
Root Cause Analysis:
- Write RCA reports, work with engineering, operations, and customer support to produce action plans
- Assist and participate in action plans, with a focus on mitigations for any systemic resiliency and reliability issues; champion improvement initiatives
- Ensure RCA action plans are followed through
- Identify “toil” and look for automation opportunities
- Create/maintain operator tools; create product enhancement requests to replace workarounds
- Work with product teams on capacity usage threshold and future planning
- Rectify underlying system issues which may have contributed to the capacity problems
You’ll be a great fit if you are:
- Great to work with and have great communication and people skills
- Passionate about building services in the cloud computing market
- Not afraid to respectfully disagree with others when quality does not meet standards
- Able to perform and make sound decisions under pressure
Additionally, you should have most (or all!) of the following:
- 5+ years experience in one of the following areas: software development, DevOps, SRE, Cloud infrastructure, QA
- Experience with building and running a cloud platform in a production environment
- Deep hands-on technical expertise in designing and deploying Linux / Unix based systems
- Experience with building and operating CI/CD pipelines
- Experience with building and maintaining authentication systems.
- Experience with monitoring tools such as Circonus, Prometheus, and InfluxDB/Telegraph/Kapacitor
- Working knowledge of all aspects of cloud infrastructure services (compute, storage, network, authentication)
Joyent, a wholly-owned subsidiary of Samsung, is the open cloud company. With its Triton Kubernetes services and support, Joyent helps its customers build and operate modern cloud native applications across multiple clouds. Joyent’s Triton Private Regions provide low cost, dedicated cloud infrastructure that gives its customers the ability to own their data and control their cloud costs.
To apply, please submit a brief introduction, a copy of your resume, and a link to your Github or LinkedIn profile to email@example.com with Senior Site Reliability Engineer, Storage in the subject. Qualified applicants with criminal histories will be considered for the position in a manner consistent with the Fair Chance Ordinance.
View All Open Positions at Joyent