Document toolboxDocument toolbox

Disaster Recovery

General Information

Our goal is to deliver the best possible service and mitigate any potential disasters. However, it's important to acknowledge that achieving a fully disaster-proof design is not feasible.

Therefore, our focus is on creating a solution capable of withstanding various disaster scenarios and critical situations, along with implementing procedures for effectively recovering from such incidents.

We continuously monitor key service metrics such as server load, storage capacity, and service availability to proactively address any issues. Additionally, we keep track of hardware metrics including temperature and network status to assess the overall health of the environment and take necessary actions when required.


Disaster Recovery Site

In many cases, relying solely on the resiliency of the primary data center is not sufficient. You can learn more about data center resiliency in the Data Center Information section.

To address this, a solution called Twin Site is employed to create a redundant datacenter design. Each customer uses a single datacenter as their primary location. However, in the event of a major failure, a second datacenter serves as the disaster recovery site for that specific customer.

The DR Site is constructed with an identical architecture to that of the primary site. It encompasses customer application servers, configuration settings, database clones, and a replica of file storage.

It's feasible for us to reroute traffic from the primary data center to the DR site with little to no data loss. Any potential loss is attributed to the asynchronous nature of a DR site.

For performance reasons, we do not maintain tight synchronization between sites; therefore there may be some non-replicated data when transitioning to fallback mode.

If the primary data center becomes non-operational, our support engineers evaluate various conditions and make individual decisions regarding fallback on an incident basis.

Backups

For more information about backup schedule and retention, please refer to Data Backups



Failure and Disaster Recovery

This document lists an overview of possible failures and measures taken to resolve them. You will also find a potential service impact classification.

Incident

Description

Recovery Procedure

Potential Service Impact

Potential Data Loss

Worst Impact level

Hard Drive Failure within acceptable limits

 

Other hardware failure within acceptable resiliency limits

Storage system is designed to withstand loss of one or more (a certain number) of drives. This incident involves failure that is within accepted number of simultaneously failed storage components (HDD, SSD, Controller etc)

 

All other hardware elements like servers, NIC, PSU, firewalls is redundant (N+1) as well.

Failed drives or hardware needs to be replaced by a service personnel as soon as possible.

No impact on service.

Service state is considered degraded or critical but functional until drives are replaced and rebuilding process is finished.

In some cases a storage system will be able to heal itself, by performing re-balancing of stored data, and return to fully safe state. The corrupted parts can be replaced at earliest convenience.

No data loss.

Customer data is not affected by failure of storage system or other hardware within acceptable limits.

Storage nodes in the main data-center are synchronized in real time. As long as at least one node survives, the data is fully safe.

None

Multiple Hard Drive or storage component failure outside allowed resiliency limits

In an unlikely event of multiple drives failure the selected data center might become inoperable.

User data is stored on at least two independent nodes.

Data-center breakdown must be analyzed. If data-center is operational a provider equipment must be replaced and most current data must be restored from offsite backups.

When main data-center is back to operational state, customer traffic can be redirected from backup to primary data center.

Service must fall back to backup / secondary data-center.

Possible marginal loss of data.

Secondary data-center is not synchronized in real time. Some operations performed minutes before catastrophic failure might not have been replicated to backup location.

Maximum allowed data loss time-frame is considered to be 24h. In practice data is synchronized throughout the day.

Low

Server Failure (processing node)

Service architecture allows for single node failure.  Each customer service resides on at least two processing nodes.

Node / server must be replaced by new equipment.

Slight reduction in service performance might be observed in some circumstances.

No data loss. Customer data is not affected by failure of single nodes / servers. The secondary processing node can access the fully synchronized storage system.

Background tasks like imports might be interrupted and should be restarted.

Low

Network failure

Network connections are fully redundant.

Failed network equipment is either replaced by data-center provider or by service provider.

In case of catastrophic failure, traffic must be redirected to secondary backup (DR Site).

No data loss. If a failure cannot be fixed within reasonable time frame, and if traffic is redirected to backup site a decision must be made about master data (Wait in read only mode or use backup as master)

Low

Power Failure

Two independent power supply lines are used

(A+B Power)

Data-center is equipped with power rectifying facilities, battery backup and diesel generator for long term power outage.

Power is provided by data-center provider.

There should be no downtime due to power loss. In case of catastrophic failure in power systems, each storage system is equipped with its own battery backed up cache that allows of recovery of all data when the power is restored.

No data loss.

Low

Human Error

 

Malicious user

 

Security Breach

Human error or malicious users is one of the hardest incidents to protect against.

Human errors include unintended deleting of data or data modification.

Data is stored both in master data-center and offsite backup location.

Offsite backup location includes 90 days rollback capability based on data snapshots.

This protects against synchronizing deleted data in all data centers.

Backup data must be restored in place of damaged or deleted data.

Possible data loss.

Data loss should be maintained within the last synchronization window, not greater than 24h.

If a rollback to specific date is ordered, changes after that date will be lost, or can be made available as a recovery service.

Medium

 

Main Data-center Destruction

 

or

 

Backup Data-center Destruction

This is a broad but unlikely case that includes: explosion, terrorist attack, fire, flood or other catastrophic event.

Customer traffic is directed to backup data center. Assessment of main data-center usability is performed.

A new data-center is selected or existing data-center is being rebuild.

Customer application runs on backup data-center.

Possible marginal loss of data.

Secondary data-center is not synchronized in real time. Some operations performed minutes before catastrophic failure might not have been replicated to backup location.

 

In case of backup data-center destruction, master data is not affected. Backup restore points can be lost until new restore points are accumulated.

Medium

Simultaneous multiple sites destruction

This scenario is theoretically possible, but would require physical destruction of two independent data center locations - main data-center and a disaster recovery data-center located almost 1000km apart.

A raw data encrypted backup is performed to third location.

Rebuilding of infrastructure and derived data is needed before the service can be restored.

(Raw data - it's a core database and files backup)

A service downtime is expected until a new data center is established.

Possible data loss. Data loss should be maintained within the last synchronization window, not greater than 24h.

High