As a Site Reliability Engineer (SRE), you’ll help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to operations problems. Much of our support and software development focuses on optimizing existing systems, building infrastructure, and reducing work through automation. You’ll join a team of curious problem solvers with a diverse set of perspectives who are thinking big and taking risks. In this environment, you’ll take the lead on relevant projects, supported by an organization that provides the support and mentorship you need to learn and grow. As an SRE, you’ll be focused on running better production applications and systems.
This Senior Site Reliability Engineer (SRE) position will help the JPMC Artificial Intelligence / Machine Learning (AI/ML) team with production support in the public cloud. In this role, you’ll be working with AI/ML and cloud engineers to build the platform, pipeline, and monitoring systems to ensure the application landscape is designed to take most advantage of JPMC’s global cloud solution.
* Design, code, test, and deliver software to automate manual operational work
* Troubleshoot priority incidents, facilitate blameless post-mortems and ensure permanent closure of incidents
* Engage with development team throughout the life cycle to help develop software for reliability and scale, ensuring minimal refactoring or changes
* Identify application patterns and analytics in support of better service level objectives
* Design self-healing and resiliency patterns
* Design automated software and product upgrades, change management, and release management solutions
* Coach or manage teams as applicable
* Participate in the 24×7 support coverage as needed
* Implement SRE frameworks to support globally multi-cloud environments, and ensure the highest level of SLA through operational excellence
* Provides failure analysis / root cause analysis when required
* Provides support to develop & improve the quality of technical engineering documentation
* Provides support to drive the maturity of the software development lifecycle
* Provides quality control of engineering deliverables
* Provides technical consultation to product management
* Performs deployment, administration, management, configuration, testing, and integration tasks related to the AI/ML platforms in cloud environment
* Helps to develop new cloud engineering strategies and implementations for the firm
* Champion a DevOps model so that services are automated and elastic across all platforms
* Helps on coaching and mentoring less experienced team members.
* Writes operation documentation and knowledge base of known issues with solutions
* Work with Architecture to design reusable patterns to deploy to applications, provide governance around adoption, and influence application development teams on roadmaps and designs.
* Identify and partner with Infrastructure teams and AD teams to implement automation opportunities to drive down toil and reduce technical debt.
* Apply standards of cloud compliance to application design to achieve reliability
* Bachelor’s degree or equivalent experience in an software engineering discipline
* Expertise in at least one technology stack designing, coding, testing, and delivering software
* Proficiency in one or more technology domains, may be a cross-domain expert able to solve complex and mission critical problems within a business or across the firm
* Working knowledge of infrastructure components (e.g. routers, load balancers, cloud products, container systems, compute, storage, and networks)
* Excellent debugging and trouble shooting skills
* Demonstrated Enterprise Cloud infrastructure experience (AWS, Azure, GCP) in a mission critical environment
* Familiar with each step in the AI model development life cycle – data collection, model development, model training, model deployment and inference.
* Familiar with any of the AI/machine learning frameworks, statistical packages, and libraries: Tensorflow, Amazon Machine Learning, Apache Spark, PyTorch, Scikit-learn etc.
* In-Depth OS experience (RHEL, Ubuntu, Windows Server) with strong debugging, troubleshooting, and problem-solving skills
* Experience in building automation and tooling in large enterprise environment and engineering productivity tools such CICD, Jenkins, code coverage.
* Experience in site reliability engineering in one of the following languages: Python, Java, PowerShell, shell scripting or GO
* Hand-on experience with cloud-based technologies and tools especially in deployment, monitoring and operations, such as Data Dog, Prometheus, Splunk, Elasticsearch, Grafana
* Strong working knowledge of modern development technologies and tools such Agile, CI/CD, Git, Terraform and Jenkins.
* Deep knowledge of Internet protocols and web services technologies such as HTTP, DNS, TCP/UDP, SOAP, JSON and REST
* Good understanding of networking protocols and cybersecurity best practices in cloud environment
* AWS certification is highly desirable
* Deep understanding of SRE philosophy, technologies, platforms and tools, SLA management, incident resolution, and automation
* Mastery of application, data and infrastructure architecture disciplines
* Command of architecture, design and business processes Keen understanding of financial control and budget management
* Expertise in working in partnership with colleagues throughout the firm, and in leading collaborative teams to achieve common goals
* Hands on experience on managing operations of large-scale internet-centric production environments for application or infrastructure services serving tens to millions of end users.
* Prior experience in large scale internet companies/technologies, where uptime and continuous availability was core to the business.
* Understanding of Networking and cloud technologies, for example Security, Load Balancing, Network routing protocols.
JPMorgan Chase & Co., one of the oldest financial institutions, offers innovative financial solutions to millions of consumers, small businesses and many of the world’s most prominent corporate, institutional and government clients under the J.P. Morgan and Chase brands. Our history spans over 200 years and today we are a leader in investment banking, consumer and small business banking, commercial banking, financial transaction processing and asset management.
We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. In accordance with applicable law, we make reasonable accommodations for applicants’ and employees’ religious practices and beliefs, as well as any mental health or physical disability needs.
Equal Opportunity Employer/Disability/Veterans