Senior Site Reliability Engineer, Incident Response

2 weeks ago

London, Greater London, United Kingdom Box Full time

WHAT IS BOX?

Box is the market leader for Cloud Content Management. Our mission is to power how the world works together. Box is partnering with enterprise organizations to accelerate their digital transformation by creating a single platform for secure content management, collaboration and workflow. We have an amazing opportunity to further establish ourselves as leaders in the space, and we need strong advocates to help us achieve that goal.

By joining Box, you will have the unique opportunity to help capture a majority of this developing market and define what content management looks like for the digital enterprise. Today, Box powers over 97,000 businesses, including 70% of the Fortune 500 who trust Box to manage their content in the cloud.

WHY BOX NEEDS YOU

Box is looking for a dynamic Global Senior Site Reliability Engineer to help lead our Global Technical Operations and oversee the continuous health, availability, and reliability of an industry-leading platforms and SaaS offerings. It is the responsibility of the TDO team to lead 24x7 GTOC teams in preventing, monitoring, identifying, troubleshooting, mitigating, and resolving issues that affect the availability and quality of Box's platforms and services.

This is an integral shift-based leader and single point of technical escalation within the GTOC organization, assuming accountability for overall production site health and the performance of core customer facing journeys. This role will help maintain total site awareness, detecting metric and service deviations, final level of change approval, and the proactive identification of potential issues; resolving them before they escalate to customer impacting incidents.

We are building a world class Operations Center and need the best talent possible to get us there. That's where you come in

WHAT YOU'LL DO

Own and direct live-site Major Incident Management from detection, identification, escalation, mitigation, and recovery.
Triage, refine, and verify the Problem Statement, notifies and coordinate the efforts of all appropriate SME resources, and lead cross-functional Incident Bridges to quickly identify and mitigate the problem and restore service. You'll be evaluated in how well you are able to reduce MTTD to MTTR.
Ensure accurate, valid and timely communication to key stakeholders and business entities.
Lead daily Incident and Change ticket reviews, coordinate and monitor change windows, and coordinate with Problem Management on TopOps Issues and action items.
Operate across organizational boundaries (Business, Dev, Ops, CS) to protect our customers, their data, and the availability of all Box services, from internal and external security threats, unanticipated volume surges, and significant performance issues.
Troubleshoot and identify critical problems in a SOA/API-based, global hybrid cloud, distributed edge architecture on multiple enterprise and public clouds regions.
Provide day to day technical expertise and experience to the organization to address issues in globally diverse, high velocity 24x7 environments - from policy and procedural decisions to key architectural and tooling insights to improve Box's Incident, Change, and Problem Management engineering capabilities.
Lead daily reviews of planned changes (CAB) in Jira; accountable for reviewing and minimizing change risk, ensuring adequate and appropriate change timing and duration, and complete rollout, validation, and rollback plans that are optimized to prevent site or service impact.
Ensure all customer-impacting Incident tickets are completely and correctly documented and augmented with appropriate metrics, timelines, actions taken, and actions still pending.
Contributes and reviews Incident postmortems to ensure adequate documentation and appropriate prioritization of action items related to reducing MTTI, MTTM and MTTR.
Participates in Problem Management scrums and Postmortems to identify leading organizational and company-wide technical issues, threats, and trends that block the ability of the organization or teams to perform their roles and provide services optimally and reliably.
Lead projects to improve tools and processes related to overall site and service manageability, observability, and resiliency.
Coordinate regularly with Infosec, Customer Success, Platform and Dev leaders to continuously access new security and customer on-boarding threats and known issues.
Continuously mentor and train Global NOC and system engineers.

WHO YOU ARE

You have 5+ years of large-scale production/platform operations experience in a large, SaaS provider environments, preferably as a Major Incident Manager, SRE team leader or Infrastructure (IaaS) or Platform (PaaS) Architecture SME in a Managed Service Provider environment.
Experience in bare metal, Openstack, and K-8 architectures supporting a large number of SOA-API-based services.
Exposure to Open Source Service-Meshes, Proxies, Caching, Message Buses (Kafka, MQS), NOSQL (Hbase, Hadoop), MYSQL clusters, and Search environments (SOLR, ES).
You should be competent in debugging global, distributed Web/API sites based on Linux systems (Ubuntu, RHL, Centos), BGP, iBGP, and IP Anycast networking in multi-vendor virtualized, Edge and hybrid public cloud architectures.
You are not expected to be an expert in all areas, but you should be familiar with common terminologies, processes, and architectures in Linux Open Source environments, as well as a thorough understanding of Virtualization, Containers, and Kubernetes.
You are confident and comfortable communicating and interacting with individual-contributors through C-level executives from multiple countries, ethnicities, and backgrounds.
You have a rock solid command presence and are calm and collected in highly stressful situations, such as a major service outage.
You're driven to continuously learn new skills and technologies.
Bachelor's degree in Computer Science or Information Systems or equivalent technical field, or similar work experience in a large-scale 24/7 production environment supporting critical, real-time applications.
Flexibility to work different shifts and provide weekend coverage depending on need.

Required Skills

Solid understanding of ITILv4 Service Lifecycle Management, Service Delivery KPIs, SLIs, SLOs, and Incident, Change, and Problem Management framework, terminology, tools (ServiceNow, Remedy, Jira Service Desk), and processes
Solid knowledge and understanding of security standards and best practices, such as: OWASP, W3C, ISO 27001, SOC1-2, PCI, and SOX
Ability to troubleshoot secured protocols such as: SSH, SSO, TLS, FTPS, WebDav, HTTPS
Solid understanding and debugging skills in TCP/IP, BGP, IP Anycast, and distributed internal and external DNS
Two years working experience and knowledge with multi-regional public cloud providers
Experience with observability tools and distributed tracing in large scale environments (Splunk, Datadog, Wavefront, Catchpoint, ThousandEyes, Sensu, SignalFX RUM, Open Telemetry, SNMP)
Good understanding and experience with configuration management tools and CI/CD pipelines - Puppet, Ansible, Terraform, Artifactory
Excellent interpersonal and communication skills

Desired Skills

Understanding of Agile methods and tools (Jira).
Experience with WAF, Bot Managers, and Content Delivery Networks (Cloudflare, Akamai)
Experience working in and transitioning into multi-regional hybrid cloud architectures (GCP preferred, AWS)
Understanding of Apache Zookeeper and Hadoop.
Experience with large production Scala, Java, Node, PHP environments helpful.
Experience working with various message bus technologies (Kafka, RabbitMQ, MQS)
Experience working with relational and non-relational databases and search engines (Mysql, Postgres, HBase, Elastic Search, SOLR)
Experience with caching apps (Squid, Redis, Memcache)
Experience with service mesh technologies in a hybrid-cloud environment (Zookeeper, Smart Stack)

BENEFITS

Box Benefits package includes pension, medical and dental coverage. We have a robust wellness program including 25 days of vacation (plus your birthday off) and subsidized gym membership. There is such a thing as a free lunch, our in-house chef prepares this daily along with lots of snacks and drinks. EMEA HQ office is located in the impressive White Collar Factory on Old Street; , European offices in Paris and Munich.

EQUAL OPPORTUNITY

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, disability, and any other protected ground of discrimination under applicable human rights legislation. Box strives to respect the dignity and ‎‎independence of people with disabilities and is committed to giving them the same ‎‎opportunity to succeed as all other employees. Accommodations are available ‎throughout ‎the application process and an employee's employment at Box.

For details on how we protect your information when you apply, please see our Personnel Privacy Notice.

#LI-EMEA

Senior Site Reliability Engineer, Incident Response

2 weeks ago

London, Greater London, United Kingdom Box Full time

WHAT IS BOX?Box is the market leader for Cloud Content Management. Our mission is to power how the world works together. Box is partnering with enterprise organizations to accelerate their digital transformation by creating a single platform for secure content management, collaboration and workflow. We have an amazing opportunity to further establish...
Global Head of Technical Cyber Incident Response

2 weeks ago

London, Greater London, United Kingdom WTW Full time

We are seeking passionate people to grow the Cyber Security team within WTW and provide an excellent service and trusted expertise to all parts of our business. As part of a business wide transformation, we have an exciting opening for a new role of Global Head of Technical Cyber Incident Response.As part of the Cyber Defence and Security Operations...
Incident Response Consultant, Talos, UK

4 minutes ago

London, Greater London, United Kingdom Cisco Systems, Inc. Full time

What You'll DoThe Cisco Talos Incident Response Consultant will work with Cisco customers, using established methodologies, to perform a variety of reactive and pro-active Incident Response related activities. These may include emergency investigations of cyber incidents, threat intelligence research, proactively hunting for adversaries in customer...
Senior Site Reliability Engineer

4 weeks ago

London, Greater London, United Kingdom Booking Full time

Job DescriptionAt , data drives our decisions. Technology is at our core. And innovation is everywhere. But our company is more than datasets, lines of code or A/B tests. We're the thrill of the first night in a new place. The excitement of the next morning. The friends you make. The journeys you take. The sights you see. And the food you sample. Through our...
Incident Commander, Network Operations

2 weeks ago

London, Greater London, United Kingdom Box Full time

WHAT IS BOX?Box is the market leader for Cloud Content Management. Our mission is to power how the world works together. Box is partnering with enterprise organizations to accelerate their digital transformation by creating a single platform for secure content management, collaboration and workflow. We have an amazing opportunity to further establish...
Incident Lead

2 months ago

London, Greater London, United Kingdom FIS Global Full time

Position Type : Full time Type Of Hire : Experienced (relevant combo of work and education) Education Desired : Bachelor's DegreeAre you ready to unleash your full potential? We're looking for people who are passionate about payments to chart Worldpay's path to being the largest and most-loved payments company in the world.About the teamThe incident,...
Incident Manager

2 weeks ago

London, Greater London, United Kingdom Morgan Philips Executive Search Full time

Incident Manager within Trading / Financial Services. My client, a blue-chip name in the Capital Markets sector, needs to hire an Incident Manager to oversee the introduction and Service Delivery/Service Management of a new trading platform. Applicants must come from the Financial Services sector, ideally Trading / Front Office, with a deep specialisation in...
Site Reliability Engineering Manager, AWS Cloud

7 days ago

London, Greater London, United Kingdom MMC Corporate Full time

Mercer IT Systems Engineering is seeking candidates for an experienced, Site Reliability Engineering Manager for AWS Cloud, based in our London office: We have ambitious and exciting plans to expand further into AWS,Here, you will have the opportunity to share your depth of technical AWS expertise with our great global SRE Cloud Engineering team plus wider...
Incident Lead

3 weeks ago

London, Greater London, United Kingdom FIS Global Full time

Position Type: - Full time Type Of Hire: - Experienced (relevant combo of work and education) Education Desired: - Bachelor's Degree Travel Percentage: - 5 - 10% Are you excited to unlock your full potential? We're seeking individuals who are truly enthusiastic about the world of payments to guide Worldpay towards becoming the biggest and most beloved...
Major Incident Manager

2 weeks ago

London, Greater London, United Kingdom FIS Full time

Position Type : Full time Type Of Hire : Experienced (relevant combo of work and education) Education Desired : Bachelor of Commerce/Business Travel Percentage : 5 - 10%Are you ready to unleash your full potential? We're looking for people who are passionate about payments to chart Worldpay's path to being the largest and most-loved payments company in the...
Reliability Engineer

2 weeks ago

London, Greater London, United Kingdom Arla Foods Full time

Are you an FMCG Engineer with strong experience in Continuous Improvement and Kaizen projects? Are you a strong collaborator, and able to network with stakeholders at all levels? Are you looking for a role to challenge the status quo and drive improvements across one of the biggest FMCG brands in the UK? At Arla, we do so much more than make some of the...
Site Reliability Engineer EMEA Duo

4 days ago

London, Greater London, United Kingdom Cisco Systems, Inc. Full time

Duo Security, now a part of Cisco, is the leading provider of Trusted Access security and multi-factor authentication delivered through the cloud.Duo's mission is to make security simple for everyone. We were born from a hacker ethos and a desire to make the Internet a secure place. We believe in empowering people to follow their passions inside and outside...
Crisis, Readiness and Response Manager

4 weeks ago

London, Greater London, United Kingdom Sky Group Full time

Want to do the best work of your life? Make your mark at Europe's leading media and entertainment brand. A workplace where you can proudly be yourself; our people make Sky a truly exciting and inclusive place to work.The Operational Resilience, Readiness and Response team at Sky are the guardians of continuity, the architects of readiness, and the first...
Outsourcing & Incident Reporting Manager

7 days ago

London, Greater London, United Kingdom PayPal Full time

At PayPal (NASDAQ:PYPL), we believe that every person has the right to participate fully in the global economy. Our mission is to democratize financial services to ensure that everyone, regardless of background or economic standing, has access to affordable, convenient, and secure products and services to take control of their financial lives.Summary: PayPal...
Senior Cloud Security Engineer – Hybrid

5 days ago

City Of London, UK, Central London, United Kingdom i3 Full time

Senior Cloud Security Engineer – HybridSenior Cloud Security EngineerAzure Security Center, Terraform, Azure DevOpsFinancial ServicesPermanentWest End, London/ Hybrid (2 days a week in the office)Circa £120,000 per annum + benefitsMy client is one of the world's leading investors in the Private Equity Secondary market and they are looking for a Senior...
Senior Infrastructure Engineer: Security

7 days ago

London, Greater London, United Kingdom NexGen Cloud Full time

NexGen Cloud is a rapidly growing IaaS company focused on providing innovative cloud solutions and infrastructure services. Our GPU cloud infrastructure solutions accelerate development in industries such as Artificial Intelligence & Machine Learning, VFX & Rendering, Data Science & IoT, and Computer Aided Engineering & MDO.We are dedicated to helping our...
Senior Information Security Engineer

1 month ago

London, Greater London, United Kingdom Mastercard Full time

Our PurposeWe work to connect and power an inclusive, digital economy that benefits everyone, everywhere by making transactions safe, simple, smart and accessible. Using secure data and networks, partnerships and passion, our innovations and solutions help individuals, financial institutions, governments and businesses realize their greatest potential. Our...
Senior Cloud Platform Engineer – Hybrid

5 days ago

City Of London, UK, Central London, United Kingdom i3 Full time

Senior Cloud Platform Engineer – HybridSenior Cloud Platform EngineerAzure Compute, Azure Network Access, Azure StorageFinancial ServicesContract6 months initiallyWest End, London/ Hybrid (2 days a week in the office)Circa £550 per day outside IR35My client is one of the world's leading investors in the Private Equity Secondary market and they are...
Mechanical & Electrical Engineer- Park Royal & Harbour Exchange Sites

7 days ago

London, Greater London, United Kingdom Equinix Full time

Who are we?Equinix is the world's digital infrastructure company, operating over 250 data centers across the globe. Digital leaders harness Equinix's trusted platform to bring together and interconnect foundational infrastructure at software speed. Equinix enables organizations to access all the right places, partners and possibilities to scale with agility,...
Senior Infrastructure Engineer IAM

2 days ago

London, Greater London, United Kingdom Barclays UK Full time

London As a Barclays Senior Infrastructure Engineer IAM, you will support in accelerating a new digital platform capability, transforming and modernising our digital estate to build a market-leading digital offering with customer experience at its heart. In this exciting role you'll be partnering with business aligned engineering and product teams, to ensure...

Americas

Europe

Asia / Oceania

Africa

Senior Site Reliability Engineer, Incident Response