Business Continuity for SMBs
Overview
For a small to medium / midsize business (SMB), planning for the activity and cost of defining a program with processes and tests with the goal of mitigating the impact of a crisis will not typically be an project affordable by startups that are hand-to-mouth. However, as a business expands and develops more comprehensive and integrated activities, and when there is more to lose causing the risk appetite of a business to decrease, investing resources in crisis planning and management becomes desirable if not necessary.
This article is meant as a primer for SMBs looking to understand Business Continuity and Business Continuity Planning. There is a nearly endless body of work including articles, case studies and Business Continuity Planning methodologies available and attainable through research and subscription or consulting services and no single article can provide the entire body of knowledge. This article is the result of research and 20+ years’ experience in business continuity and incident management.
World events that cause a business to have to react are often challenging and can increase the risk that outcomes will be less than desirable. While we cannot possibly plan for and create contingencies for everything (largely because we don’t know what we don’t know) there are proven methods to prepare for incidents. Businesses that may not have considered the ramifications of an inability to work in the office or allow customers into a store can be thrust into a sink or swim situation that requires decisions be made at a frenzied pace where enough wrong decisions may end the business. Even if your business was unprepared for a major incident (like a global pandemic) and it survived, there is no better time to acquaint yourself with the formal practice of:
- Understanding the potential impact on the business during an incident.
- Planning for the activities that will mitigate the impact enough to allow the business to sustain itself through the crisis.
- Taking the experience from the crisis as input to future planning and testing to continuously improve.
The ultimate goal is building a Business Continuity Program / Plan that can be used for defining and testing the effective execution of activities that enable a business to continue the delivery of services in whole or in part during a time of crisis or incident.
Critical success factors include:
- Defining Services – The ability to define services and the deliverables resulting from those services. For example, a doctor’s office wants to deliver urgent care services, healthy patient services like checkups and physicals, and delivering medical advice in person or over the phone to name a few. A fast food restaurant wants to produce meals to in-person customers and deliver catering services to local business clients. A software as a service (SaaS) business wants its online portals to function and produce the information expected by its’ clients.
- Understanding Risk – Assess known risks that may cause business processes to be disrupted. What are the threats and how can they be categorized and rated in terms of potential for damage? Are there existential threats, those that pose a risk of shutting down the business?
- Understanding Impact – Analyze defined services to understand the impact on service delivery due to incidents that may cause a slowdown or total disruption. Are lives at stake?
- Planning and Preparation – Create the plans that include people, process and technology that will allow for the operation to recover to the defined point where services are being delivered at an acceptable level.
- Testing and Maintaining – Test and continuously update plans to ensure that they are understood by employees, are actually achievable and are kept up to date as the business continues to change and grow.
Business Continuity
Business Continuity (BC) is the overarching term used to describe the program created and implemented to return the flow of business operations to deliver critical services to customers in the event of an incident caused by a fire, flood or malicious attack by cybercriminals. It can be thought of in two main domains:
- Service Delivery: The processes and human resources required to provide services to customers. In most cases this refers to the “customer facing” portion of service delivery whether it be process-to-customer or person-to-customer.
- Support Systems: The back-end or pre-delivery people, process and technology that enable the customer facing processes to execute properly.
A Business Continuity Plan is the documentation describing the people, process and technology that will be used to bring critical services back online during or after the incident that caused the business interruption.
Businesses vary widely on the services they deliver and the effort required to produce those services. Depending on the business model, Business Continuity Planning will require varying levels of effort and resources. Business Continuity Planning is a term used by information security and incident management professionals and at its essence means “to plan for the continued (uninterrupted) delivery of services to customers and clients during an incident disrupting normal business operations”. In other words to plan for “continuity” when faced with an incident that is pressuring those services to cease being delivered, or slowed to a point where customers and clients are not being serviced within their level of expectations, whether those expectations are defined formally or implied.
There are myriad examples of businesses implementing contingency measures for specific risks like supply chain interruption, loss of power, delivery obstruction and so on. Business Continuity Planning institutes a formal plan that is understood by employees, updated to reflect business changes and tested on a regular basis. While contingency planning and Business Continuity Planning can be taken interchangeably, Business Continuity Planning as a term has become the popular way to refer to an overall program designed to evaluate all critical business services and develop plans to keep services up and running.
There are a number of scenarios that may cause a business to decide to create a Business Continuity program:
- A regional power outage occurs shutting down the office building where employees work every day. What if employees are not trained on how to deliver services from another location, be that home or a remote office with backup power like a diesel generator?
- You have a network outage due to a recent change made to a router or firewall that is not remedied for a full day, and you recently switched your phones from a local PBX to VoIP, which requires the network. How will employees interact with clients?
- Your professional office (doctor, lawyer, etc.) has been infiltrated by malware that was not detected by your anti-virus software and has rendered most of the office PCs unusable. What is the plan to identify, protect, detect, respond and recover? How will patients be admitted? How will para-legals do their research?
Disaster Recovery
Business Continuity as a term sometimes gets confused or interchanged with another term that is within the overall Business Continuity domain – Disaster Recovery (DR). While Business Continuity is a term used to describe the entire domain of planning and testing to ensure that critical services can be delivered during an incident or crisis, DR, which is part of Business Continuity, is meant to describe the plans and processes used to recover the support systems in place that allow products and services to be delivered to customers. In other words, a DR plan is focused on restoring Information technology (IT) infrastructure and operations after a disaster has been declared.
An example of a situation requiring a DR plan is one where a company’s data center (or computer room, or LAN closet, etc.) becomes unusable due to a power outage or a disaster. The company has database servers on which customer and other necessary data resides that help to provide the services that the business offers using web and application servers that write and read from the database servers, manipulate the data and display it on a browser via the webserver.
Without a DR plan, the company cannot deliver services until the data center or access to it is restored which may take hours, days or weeks. A very basic plan would document in detail the steps necessary to refresh or recreate the components (servers) in another datacenter and then restore data to the databases from a backup to bring services back online. This could take upwards of between 2 and 5 days, however that may seem okay if your customer’s expectations or the service level agreements you have with customers are in line. One note of caution – in today’s climate of immediate response, even with an agreement with a customer that 2 days of downtime are acceptable, you will likely lose customers if services are down for that length of time. They may not be able to get a refund or sue, but they will take a look at your competitors. Better to have a plan that allows for recovery in minutes or hours instead of days.
A more advanced disaster recovery plan will have anticipated a potential incident taking servers and networks offline and implemented several safeguards to hasten the recovery of computing resources. For example, you may pre-configure web and application servers in a duplicate and remote datacenter, and continuously replicate database data to the remote location. If these compute resources are in an offline state under normal circumstances the datacenter service provider (AWS, Azure or a more local provider) will charge next to nothing to have them defined and ready to launch.
An even less hands-on approach to DR is to take advantage of a datacenter service provider’s “DR as a Service” or DRaaS. Unlike manually defined recovery methods, datacenter (or cloud) service providers use continuous data protection techniques, enabling sub-second Recovery Point Objectives (RPOs). Highly automated machine conversion and orchestration can enable Recovery Time Objectives (RTOs) of minutes or even seconds, but at a significantly higher cost than manual or pre-defined methods.
Information Security’s Role
What does Business Continuity, DR and availability have to do with information security? At its core, information security is made up of the methods implemented to protect a business’s assets and data, providing confidentiality, integrity and availability (also known as the CIA triad) of systems and data to match the organization’s tolerance for risk or risk appetite. It is the availability part of the CIA triad that information security professionals are concerned with when looking at Business Continuity and Business Continuity Planning.
Availability protection is the process and action taken to allow authorized persons to consistently access data within the time periods that are agreed upon or expected. That could mean making sure that the most critical information contained in files in a file room, documents on a disk drive or data in a database remain accessible.
Availability then is where the information security organization is concerned with maintaining access to data and information. In most cases, the information security team must work closely with building facilities, business heads, compute and network infrastructure teams and application teams to facilitate and lead the properly sized Business Continuity program (including DR) for the business.
Risk Assessments
Assessment of risk is an important step taken when designing and building a Business Continuity program. When assessed properly and understood, risk can be mitigated and managed to an acceptable level. What is the risk to a business (or a particular business line within a business) should it discontinue delivering services at the level expected by its’ customers? Loss of revenue? Loss of reputation?
While there is no way of eliminating risk totally, by analyzing business processes and the support structures in place to enable them, a business can take the proper steps to reduce the overall risk level to one commensurate with what the business can tolerate. Operational risk mitigation and Business Continuity Planning will be a combination of what a business has deemed is necessary to protect, the level of risk it can accept or tolerate and the resources the business is willing to invest to protect it.
There are several good frameworks that can be used to conduct a risk assessment. Among them:
- Operationally Critical Threat, Asset and Vulnerability Evaluation (OCTAVE)
- Factor Analysis of Information Risk (FAIR)
- National Institute of Standards and Technology’s (NIST) Risk Management Framework (RMF)
- Threat Agent Risk Assessment (TARA)
However, an SMB should not get hung up on using a large framework as there are frameworks available that may be suited for smaller enterprises:
A risk assessment is necessary to expose threats that may equate to risks and then to treat those risks in order to mitigate them. Knowing the external and internal threats that may affect a business is a key critical success factor in Business Continuity planning. For example, after conducting a risk assessment one may document that the threat of power loss to the office building or just the inability to access the building is a high level risk because employees use the computing and network infrastructure in the building to service customers. But are all services equal and should a BCP be written that ensures that 100% of all services are continued along the same levels and timelines? Probably not. Risks must be rated and treated according to their assigned level and understanding the impact on services is another key factor in creating a successful BC plan.
Risks can be categorized at a high level into preventable, strategic and external. Preventable risks include those that can be managed by imposing rules and guidelines, like ethical behavior and following procedure. Strategic risks are often desirable because the company is taking on the risk to generate more revenue. Strategic risks must be addressed by applying risk management techniques that reduce the probability of the risk occurring at all or occurring at an acceptable impact level. Finally, external risks also require risk mitigation techniques, focusing on mitigating the impact of the incident occurring.[i]
Identified risks should be categorized and rated. One way to do this is to use a risk probability / impact matrix which rates risk based on the probability that it will happen and the potential impact on the business. In the example below, the more red the higher the risk and the more green the lower the risk.
Business Impact Analysis (BIA)
There are a few key steps to take in determining which parts of the business must be recovered, how quickly recovery must be achieved, and in what order each segment should be addressed – if at all (meaning some parts of a business may remain in a down state perpetually or until the crisis is over). By conducting a BIA a business can determine, down to the level of granularity required (e.g. business line or individual service) the:
Recovery Time Objective (RTO) – The maximum amount of time allowable to recover the defined service after which deleterious effects on revenue and / or reputation will ensue. If an RTO of 4 hours is defined, then from the time services cease to meet service level expectations there are 4 hours available to bring services back.
Recovery Point Objective (RPO) – The maximum targeted period in which records, data (transactions) or other information might be lost. In the case of a database the RPO will designate the backup interval. For example, if the RTO is 12 hours, that means the business is willing to tolerate losing up to 12 hours of changes made to the database, so the backup interval will be set so that a backup occurs every 12 hours and when an incident occurs the last backup can be used to restore the database, and the data will be 12 hours old.
At a high level, conducting a BIA entails:
- Documenting products and services within departments, listing them in a business function inventory worksheet.
- Documenting the consequences of not delivering each product and service.
- Rating each product or service along as many factors as may be necessary to be able to determine the proper recovery times. For example, how much revenue per hour or day does the service bring in? How important to your business reputation is it that this particular service be up and running? How would you rate the overall negative impact of this service being down or of the product not being produced – is anyone in danger?
- Identification of the delivery mechanisms and sub-processes for identified products and services. Create a dependency map of business functions to ensure that you know all of the sub-processes and sub-functions that are required to fully or partially recover a product or service.
- Make a decision about the recovery times (the RTO) for resuming each segment of the business and the products and services. For example, 1-hour? 4-hours? 1-week?
- Determine the minimum resources required to restart the functions and processes that allow products and services to resume either partially or fully.
When conducting a BIA the data should be collected in an application or database that will allow for review and change to occur. This can range from an Excel spreadsheet to an application specifically geared to BIA like BCMertics (https://bcmmetrics.com/bia-on-demand/).
Here are some of the most likely data points as columns on a BIA spreadsheet:
- Product or Service Name
- Required Critical Activities – What are the activities required to deliver this product or service?
- Seasonal Variations – peaks, etc.
- Service Level Agreements / Expectations – What commitments or expectations exist relating to delivery times?
- Inward Departmental Dependencies – What other business areas are required and provide services to be able to deliver this product or service?
- Outbound Departmental Dependencies – What other business areas depend on this product or service to be able to produce another?
- IT Support – Applications – What IT applications must be available in order to be able to deliver this product or service?
- IT Support – Services – What other IT services are required (e.g. help desk).
- Vendors / Suppliers – Are there vendors that are required to deliver services? What are the services and what is the time schedule for the service delivery?
- Key contacts – Department heads, vendor contact points, IT administrators, etc.
Working backward from the end delivery point is often a good method to get to the full delivery chain including integration points and dependencies.
Building the Business Continuity Plan
Whether your business has been through a crisis and has survived by making the right decisions or by luck, you may have now made the decision to implement a Business Continuity program. While there will be some investment necessary to build, implement and maintain the program, it can begin small and become as comprehensive as your risk tolerance dictates. Business Continuity Plans describe the actions to be taken during an incident to assure the continuity of those services that have been deemed critical while a Business Continuity program includes business continuity plans, disaster recovery plans and test plans ranging from tabletop exercises to fully live drills where employees work from remote locations and issues are tracked to improve the likelihood of success during an actual incident.
Depending on the size of the business, a Business Continuity program requires ongoing investment in resources to build the BC team and to train those employees deemed to be critical to contribute to the process of recovering services. It will also require labor cost to conduct tests and drills and document and deliver reports to executive management on the state of the BC program. In addition, the program may require technical tools, BC-specific applications and consultants.
Creation of the plan presumes that:
- A Risk Assessment has been conducted to determine what internal or external risks must be addressed and to what level.
- A BIA has been conducted and we know what critical services must be brought online during an incident, the supporting functions and processes for those services, the maximum time the service can be offline (RTO) and the point in time that any data required for services will be recovered to (RPO).
- Senior management has approved of the project to create a Business Continuity Plan or a full program which would include the plan itself, tabletop test plans, drill-type test plans, processes to keep BIAs fresh in order to update plans with changes to business processes, etc.
Step 1 – Define the Business Continuity Team
An example of the BC team will be:
- An Executive Sponsor – CEO, CFO, COO, board member or other executive level resource.
- The CISO or most senior Information Security Officer
- A BC Coordinator and a backup that will be responsible for writing plans, communication to the BC team and company as a whole, facilitating tests and documenting test results. (At a smaller sized company with employee count between 20 and 100 the BC coordinator may be the senior information security officer.)
- BC Declaration Team – comprised of those employees with the responsibility to declare an incident to start the BC recovery process during an actual crisis.
- BC Participating Managers – Responsible for ensuring that their teams are ready and able to provide services to customers during an incident that requires execution of BCP and may include: Director of Customer Services, Manager of DevOps,
Step 2 – Create the Plan
This is where the BC coordinator, in communication with others on the BC team, writes the plan to address the business services and functions that will be recovered during an incident and the steps required to conduct recovery to the level required.
The plan defines the terms and documents the requirements, scope, assumptions, roles and responsibilities necessary to allow the company’s clients to continue to receive services within contractual expectations during an incident that causes the company home office to be inaccessible. Sections of a viable plan will likely include:
- Objective – A statement of the desired outcomes of executing the plan. Example – “The objective of the Company Business Continuity Plan (BCP or the plan) is to define the terms and
document the requirements, scope, assumptions, roles and responsibilities necessary to allow Company Clients to continue to receive services within contractual expectations during an incident that causes the company home office to be inaccessible.” - Scope – The circumstances or identified risks that this plan is meant to protect against. For example, one of the bulleted items would read “Access to company offices is prohibited, but the network is up.”
- Business Continuity Organization (roles on the BC team) – An up to date list of the roles, people assigned in those roles and a pointer to the contact list for all employees required to bring about successful achievement of the objectives.
- Preparation – Repeatable actions which must be taken to ensure that when an incident occurs and an event is declared, the plan will execute successfully. For example, “employees must test their internet connection from home and test access to the VPNs they need at least once per quarter”.
- Assumptions – A list of items that are assumed to be true because the preparation steps have been carried out. For example, “Employees have a working cell phone that can send and receive text and chat messages.”
- Procedure and Event Process – The actual step-by-step instructions from event declaration to successful resumption of the service.
- Required Resources – An internet connection, a company laptop, a VPN connection, etc.
Example:
A Software as a Service (SaaS) business that has over 1,000 clients worldwide with a Client Success Manager (CSM) assigned to each client loses the ability to work from within corporate offices. Power is on at the office and the network is running as normal, but people are not allowed into the building.
The plan was written and implemented to ensure that services to clients do not break the SLA. While it may be obvious that people can work from home with an internet connection, that is just one part of the solution. When conducting the BIA it may be discovered that underlying requirements exist for CSMs to adequately support certain clients. There may be many factors to consider and requirements to test. In this example here are the unique circumstances:
- Due to a legacy fat-client application which is a front-end query tool still in use by some customers, a number of Account Managers who service those clients will require VPN access into the office.
- Customer Support Engineers (CSE) and DevOps Engineers (DE) require VPN access directly into the datacenter service provider where the compute and storage infrastructure resides in order to resolve technical issues that arise under normal circumstances.
- Employees will revert to a manual process to log time they spend with clients for billable purposes. The manual time tracking documents will be collected at the end of each day.
- CSMs whose internet connection is not functioning or have other issues prohibiting them from being able to provide services are to escalate to the IT help desk for assistance. The IT help desk has instructions to immediately notify the manager of the CSM so their client account workload can be redistributed to other CSMs.
Step 3 – Test and Conduct Drills
This step is as important as creating the plan itself. Without tabletop testing and / or running a drill (also called a live fire exercise) at least yearly the plan will have a much lower chance of being effective during an actual crisis. For a tabletop exercise, use an identified scenario that is relevant to the business (hint – this comes from the risk assessment and the BIA).
A tabletop test is a simulated threat exercise that must be scripted prior to execution. The scripting should be performed by a few select members of the BC team and should not be too prescriptive and instead leave some room for flexibility. This type of test has both executive / management and technical components. It may be best to separate them into two tests depending on the length and depth of the test. If people’s time is wasted on areas they do not need to be involved, you may not get participation in the future.
In a live fire exercise, the incident is simulated and employees conduct business and deliver services according to the plan. For example on a pre-determined day employees will be instructed to follow the plan in place for an incident which renders the office inaccessible and without power. Employees would do whatever they need to do according to the plan, for example – working from home. Not all potential incidents can be live-fire tested because they may result in lower service levels. In these cases asking a limited set of customers to participate in a test can be validating for the plan and increase the bond with customers.
A disaster recovery test simulates that the data center is not usable and all IT “fails over” to the DR site where infrastructure is set up to take on the IT production load. This is an extremely intensive and complex test that requires safeguards to ensure that databases are not erroneously corrupted with invalid or ambiguous transactions during the test and requires a “fail back” plan to end the test. Preparation for this type of test can include network engineers blocking the real production site from being accessed by the DR site such that no one can log on to a server or other device in the main data center and “cheat” by bringing in an updated application from the production site during the test.
Step 4 – Update the Plan
The plan should be updated at least yearly, but probably more often based on any of the following contributing factors:
- Changes in the threat landscape and risk profile for the business.
- Changes in business process or any other major business difference that renders the last BIA obsolete or lacking.
- Anything learned for tabletop or live-fire exercises.
It may be wise to integrate updating BC plans with the change management process, ensuring that plans are updated as the business changes.
Summary
A Business Continuity program is a continuously improving set of dynamic processes aimed at providing Business Continuity services within a company. A Business Continuity program will often include:
- Risk Assessment
- Business Impact Analysis
- Disaster Recovery Plan
- Business Continuity Plan
- Business Continuity Communication (to customers and employees)
- Business Continuity Team
Business Continuity Planning is an important process that can help a business know what must be done when an incident occurs that poses a risk to delivering products or services. By planning for product or service outages focused on the products and services that are the most critical to your business, your customers have a higher probability of enjoying continued services when incidents occur which may build trust and loyalty.
Finding Help
An entire article can be written on how to approach finding the proper external help to create a Business Continuity Program. There are hundreds and likely thousands of information security consulting firms willing to provide services from complete BC program creation and operation to conducting risk assessments and business impact analysis to running table-top exercises.
Most companies would do well to evaluate InfoSec firms in terms of experience and tools. How much experience do the actual consultants that will work on your engagement have for the specific services you require? What pre-packaged services and tools come with the service that ensures that the firm is not charging you for re-inventing the wheel? For example, customizable templates, a methodology for collecting information, a tool to calculate risk levels, etc.
A brief (and certainly not exhaustive) list of services a business may seek include:
- Initial business continuity maturity assessment
- Business impact analysis on a particular division or service, or the entire enterprise
- Conducting a full information security risk assessment or one that is specifically geared toward incident management
- Disaster recovery planning and testing
- Business continuity operations
- Creating a business continuity plan
- Creating a full business continuity program
- Facilitating a table-top exercise
Resources
There are many resources available to delve deeper into the subject matter of Business Continuity as a whole, including whitepapers, templates, methodologies, training programs and consulting services.
Risk Assessment:
https://hbr.org/2012/06/managing-risks-a-new-framework – This article discusses risk management as more than just checking boxes and imposing rules that employees must follow. The point is that not all risks can be managed in a rules-based motif.
https://safetyculture.com/checklists/15-best-risk-assessment-checklists/ – This article provides links to risk assessment checklists that can help to jumpstart a risk assessment at an SMB and avoid getting stalled by the weight of a large risk assessment framework.
Risk Assessment: Tools, Techniques, and Their Applications 2nd Edition by Lee T. Ostrom and Cheryl A. Wilhelmsen
The Security Risk Assessment Handbook: A Complete Guide for Performing Security Risk Assessments, Second Edition 2nd Edition by Douglas Landoll
Business Impact Analysis:
https://www.smartsheet.com/business-impact-analysis-template This article provides links to BIA templates that can be used to create a BIA that works for the business model.
Practitioner’s Guide to Business Impact Analysis (Internal Audit and IT Audit) by Priti Sikdar
Business Continuity:
Journal of Business Continuity & Emergency Planning a Henry Stewart publication that provides case studies of actual incidents and how they were handled.
Business Continuity Planning: A Project Management Approach by by Ralph L. Kliem (Author), Gregg D. Richie (Author)
Business Continuity Management: Global Best Practices 4th Edition by Andrew Hiles (Author), Kristen Noakes-Fry
[i] Harvard Business Review. Managing Risks: A New Framework” Robert S. Kaplan and Anette Mikes