Recent from talks
Contribute something
Nothing was collected or created yet.
Backup
View on Wikipedia
In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", whereas the noun and adjective form is "backup".[1] Backups can be used to recover data after its loss from data deletion or corruption, or to recover data from an earlier time. [2] Backups provide a simple form of IT disaster recovery; however not all backup systems are able to reconstitute a computer system or other complex configuration such as a computer cluster, active directory server, or database server.[3]
A backup system contains at least one copy of all data considered worth saving. The data storage requirements can be large. An information repository model may be used to provide structure to this storage. There are different types of data storage devices used for copying backups of data that is already in secondary storage onto archive files.[note 1][4] There are also different ways these devices can be arranged to provide geographic dispersion,[5] data security, and portability.
Data is selected, extracted, and manipulated for storage. The process can include methods for dealing with live data, including open files, as well as compression, encryption, and de-duplication. Additional techniques apply to enterprise client-server backup. Backup schemes may include dry runs that validate the reliability of the data being backed up. There are limitations[6] and human factors involved in any backup scheme.
Storage
[edit]A backup strategy requires an information repository, "a secondary storage space for data"[7] that aggregates backups of data "sources". The repository could be as simple as a list of all backup media (DVDs, etc.) and the dates produced, or could include a computerized index, catalog, or relational database.
3-2-1 Backup Rule
[edit]The backup data needs to be stored, requiring a backup rotation scheme,[4] which is a system of backing up data to computer media that limits the number of backups of different dates retained separately, by appropriate re-use of the data storage media by overwriting of backups no longer needed. The scheme determines how and when each piece of removable storage is used for a backup operation and how long it is retained once it has backup data stored on it. The 3-2-1 rule can aid in the backup process. It states that there should be at least 3 copies of the data, stored on 2 different types of storage media, and one copy should be kept offsite, in a remote location (this can include cloud storage). 2 or more different media should be used to eliminate data loss due to similar reasons (for example, optical discs may tolerate being underwater while LTO tapes may not, and SSDs cannot fail due to head crashes or damaged spindle motors since they do not have any moving parts, unlike hard drives). An offsite copy protects against fire, theft of physical media (such as tapes or discs) and natural disasters like floods and earthquakes. Physically protected hard drives are an alternative to an offsite copy, but they have limitations like only being able to resist fire for a limited period of time, so an offsite copy still remains as the ideal choice.
Because there is no perfect storage, many backup experts recommend maintaining a second copy on a local physical device, even if the data is also backed up offsite.[8][9][10][11]
Backup methods
[edit]Unstructured
[edit]An unstructured repository may simply be a stack of tapes, DVD-Rs or external HDDs with minimal information about what was backed up and when. This method is the easiest to implement, but unlikely to achieve a high level of recoverability as it lacks automation.
Full only/System imaging
[edit]A repository using this backup method contains complete source data copies taken at one or more specific points in time. Copying system images, this method is frequently used by computer technicians to record known good configurations. However, imaging[12] is generally more useful as a way of deploying a standard configuration to many systems rather than as a tool for making ongoing backups of diverse systems.
Incremental
[edit]An incremental backup stores data changed since a reference point in time. Duplicate copies of unchanged data are not copied. Typically a full backup of all files is made once or at infrequent intervals, serving as the reference point for an incremental repository. Subsequently, a number of incremental backups are made after successive time periods. Restores begin with the last full backup and then apply the incrementals.[13] Some backup systems[14] can create a synthetic full backup from a series of incrementals, thus providing the equivalent of frequently doing a full backup. When done to modify a single archive file, this speeds restores of recent versions of files.
Near-CDP
[edit]Continuous Data Protection (CDP) refers to a backup that instantly saves a copy of every change made to the data. This allows restoration of data to any point in time and is the most comprehensive and advanced data protection.[15] Near-CDP backup applications—often marketed as "CDP"—automatically take incremental backups at a specific interval, for example every 15 minutes, one hour, or 24 hours. They can therefore only allow restores to an interval boundary.[15] Near-CDP backup applications use journaling and are typically based on periodic "snapshots",[16] read-only copies of the data frozen at a particular point in time.
Near-CDP (except for Apple Time Machine)[17] intent-logs every change on the host system,[18] often by saving byte or block-level differences rather than file-level differences. This backup method differs from simple disk mirroring in that it enables a roll-back of the log and thus a restoration of old images of data. Intent-logging allows precautions for the consistency of live data, protecting self-consistent files but requiring applications "be quiesced and made ready for backup."
Near-CDP is more practicable for ordinary personal backup applications, as opposed to true CDP, which must be run in conjunction with a virtual machine[19][20] or equivalent[21] and is therefore generally used in enterprise client-server backups.
Software may create copies of individual files such as written documents, multimedia projects, or user preferences, to prevent failed write events caused by power outages, operating system crashes, or exhausted disk space, from causing data loss. A common implementation is an appended ".bak" extension to the file name.
Reverse incremental
[edit]A Reverse incremental backup method stores a recent archive file "mirror" of the source data and a series of differences between the "mirror" in its current state and its previous states. A reverse incremental backup method starts with a non-image full backup. After the full backup is performed, the system periodically synchronizes the full backup with the live copy, while storing the data necessary to reconstruct older versions. This can either be done using hard links—as Apple Time Machine does, or using binary diffs.
Differential
[edit]A differential backup saves only the data that has changed since the last full backup. This means a maximum of two backups from the repository are used to restore the data. However, as time from the last full backup (and thus the accumulated changes in data) increases, so does the time to perform the differential backup. Restoring an entire system requires starting from the most recent full backup and then applying just the last differential backup.
A differential backup copies files that have been created or changed since the last full backup, regardless of whether any other differential backups have been made since, whereas an incremental backup copies files that have been created or changed since the most recent backup of any type (full or incremental). Changes in files may be detected through a more recent date/time of last modification file attribute, and/or changes in file size. Other variations of incremental backup include multi-level incrementals and block-level incrementals that compare parts of files instead of just entire files.
Storage media
[edit]
Regardless of the repository model that is used, the data has to be copied onto an archive file data storage medium. The medium used is also referred to as the type of backup destination.
Magnetic tape
[edit]Magnetic tape was for a long time the most commonly used medium for bulk data storage, backup, archiving, and interchange. It was previously a less expensive option, but this is no longer the case for smaller amounts of data.[22] Tape is a sequential access medium, so the rate of continuously writing or reading data can be very fast. While tape media itself has a low cost per space, tape drives are typically dozens of times as expensive as hard disk drives and optical drives.
Many tape formats have been proprietary or specific to certain markets like mainframes or a particular brand of personal computer. By 2014 LTO had become the primary tape technology.[23] The other remaining viable "super" format is the IBM 3592 (also referred to as the TS11xx series). The Oracle StorageTek T10000 was discontinued in 2016.[24]
Hard disk
[edit]The use of hard disk storage has increased over time as it has become progressively cheaper. Hard disks are usually easy to use, widely available, and can be accessed quickly.[23] However, hard disk backups are close-tolerance mechanical devices and may be more easily damaged than tapes, especially while being transported.[25] In the mid-2000s, several drive manufacturers began to produce portable drives employing ramp loading and accelerometer technology (sometimes termed a "shock sensor"),[26][27] and by 2010 the industry average in drop tests for drives with that technology showed drives remaining intact and working after a 36-inch non-operating drop onto industrial carpeting.[28] Some manufacturers also offer 'ruggedized' portable hard drives, which include a shock-absorbing case around the hard disk, and claim a range of higher drop specifications.[28][29][30] Over a period of years the stability of hard disk backups is shorter than that of tape backups.[24][31][25]
External hard disks can be connected via local interfaces like SCSI, USB, FireWire, or eSATA, or via longer-distance technologies like Ethernet, iSCSI, or Fibre Channel. Some disk-based backup systems, via Virtual Tape Libraries or otherwise, support data deduplication, which can reduce the amount of disk storage capacity consumed by daily and weekly backup data.[32][33][34]
Optical storage
[edit]
Optical storage uses lasers to store and retrieve data. Recordable CDs, DVDs, and Blu-ray Discs are commonly used with personal computers and are generally cheap. The capacities and speeds of these discs have typically been lower than hard disks or tapes. Advances in optical media may shrink that gap in the future.[35][36]
Potential future data losses caused by gradual media degradation can be predicted by measuring the rate of correctable minor data errors, of which consecutively too many increase the risk of uncorrectable sectors. Support for error scanning varies among optical drive vendors.[37]
Many optical disc formats are WORM type, which makes them useful for archival purposes since the data cannot be changed in any way, including by user error and by malware such as ransomware. Moreover, optical discs are not vulnerable to head crashes, magnetism, imminent water ingress or power surges; and, a fault of the drive typically just halts the spinning.
Optical media is modular; the storage controller is not tied to media itself like with hard drives or flash storage (→flash memory controller), allowing it to be removed and accessed through a different drive. However, recordable media may degrade earlier under long-term exposure to light.[38]
Some optical storage systems allow for cataloged data backups without human contact with the discs, allowing for longer data integrity. A French study in 2008 indicated that the lifespan of typically-sold CD-Rs was 2–10 years,[39] but one manufacturer later estimated the longevity of its CD-Rs with a gold-sputtered layer to be as high as 100 years.[40] Sony's proprietary Optical Disc Archive[23] can in 2016 reach a read rate of 250 MB/s.[41]
Solid-state drive
[edit]Solid-state drives (SSDs) use integrated circuit assemblies to store data. Flash memory, thumb drives, USB flash drives, CompactFlash, SmartMedia, Memory Sticks, and Secure Digital card devices are relatively expensive for their low capacity, but convenient for backing up relatively low data volumes. A solid-state drive does not contain any movable parts, making it less susceptible to physical damage, and can have huge throughput of around 500 Mbit/s up to 6 Gbit/s. Available SSDs have become more capacious and cheaper.[42][29] Flash memory backups are stable for fewer years than hard disk backups.[24]
Remote backup service
[edit]Remote backup services or cloud backups involve service providers storing data offsite. This has been used to protect against events such as fires, floods, or earthquakes which could destroy locally stored backups.[43] Cloud-based backup (through services like or similar to Google Drive, and Microsoft OneDrive) provides a layer of data protection.[25] However, the users must trust the provider to maintain the privacy and integrity of their data, with confidentiality enhanced by the use of encryption. Because speed and availability are limited by a user's online connection,[25] users with large amounts of data may need to use cloud seeding and large-scale recovery.
Management
[edit]Various methods can be used to manage backup media, striking a balance between accessibility, security and cost. These media management methods are not mutually exclusive and are frequently combined to meet the user's needs. Using on-line disks for staging data before it is sent to a near-line tape library is a common example.[44][45]
Online
[edit]Online backup storage is typically the most accessible type of data storage, and can begin a restore in milliseconds. An internal hard disk or a disk array (maybe connected to SAN) is an example of an online backup. This type of storage is convenient and speedy, but is vulnerable to being deleted or overwritten, either by accident, by malevolent action, or in the wake of a data-deleting virus payload.
Near-line
[edit]Nearline storage is typically less accessible and less expensive than online storage, but still useful for backup data storage. A mechanical device is usually used to move media units from storage into a drive where the data can be read or written. Generally it has safety properties similar to on-line storage. An example is a tape library with restore times ranging from seconds to a few minutes.
Off-line
[edit]Off-line storage requires some direct action to provide access to the storage media: for example, inserting a tape into a tape drive or plugging in a cable. Because the data is not accessible via any computer except during limited periods in which they are written or read back, they are largely immune to on-line backup failure modes. Access time varies depending on whether the media are on-site or off-site.
Off-site data protection
[edit]Backup media may be sent to an off-site vault to protect against a disaster or other site-specific problem. The vault can be as simple as a system administrator's home office or as sophisticated as a disaster-hardened, temperature-controlled, high-security bunker with facilities for backup media storage. A data replica can be off-site but also on-line (e.g., an off-site RAID mirror).
Backup site
[edit]A backup site or disaster recovery center is used to store data that can enable computer systems and networks to be restored and properly configured in the event of a disaster. Some organisations have their own data recovery centres, while others contract this out to a third-party. Due to high costs, backing up is rarely considered the preferred method of moving data to a DR site. A more typical way would be remote disk mirroring, which keeps the DR data as up to date as possible.
Selection and extraction of data
[edit]A backup operation starts with selecting and extracting coherent units of data. Most data on modern computer systems is stored in discrete units, known as files. These files are organized into filesystems. Deciding what to back up at any given time involves tradeoffs. By backing up too much redundant data, the information repository will fill up too quickly. Backing up an insufficient amount of data can eventually lead to the loss of critical information.[46]
Files
[edit]- Copying files: Making copies of files is the simplest and most common way to perform a backup. A means to perform this basic function is included in all backup software and all operating systems.
- Partial file copying: A backup may include only the blocks or bytes within a file that have changed in a given period of time. This can substantially reduce needed storage space, but requires higher sophistication to reconstruct files in a restore situation. Some implementations require integration with the source file system.
- Deleted files: To prevent the unintentional restoration of files that have been intentionally deleted, a record of the deletion must be kept.
- Versioning of files: Most backup applications, other than those that do only full only/System imaging, also back up files that have been modified since the last backup. "That way, you can retrieve many different versions of a given file, and if you delete it on your hard disk, you can still find it in your [information repository] archive."[4]
Filesystems
[edit]- Filesystem dump: A copy of the whole filesystem in block-level can be made. This is also known as a "raw partition backup" and is related to disk imaging. The process usually involves unmounting the filesystem and running a program like dd (Unix).[47] Because the disk is read sequentially and with large buffers, this type of backup can be faster than reading every file normally, especially when the filesystem contains many small files, is highly fragmented, or is nearly full. But because this method also reads the free disk blocks that contain no useful data, this method can also be slower than conventional reading, especially when the filesystem is nearly empty. Some filesystems, such as XFS, provide a "dump" utility that reads the disk sequentially for high performance while skipping unused sections. The corresponding restore utility can selectively restore individual files or the entire volume at the operator's choice.[48]
- Identification of changes: Some filesystems have an archive bit for each file that says it was recently changed. Some backup software looks at the date of the file and compares it with the last backup to determine whether the file was changed.
- Versioning file system: A versioning filesystem tracks all changes to a file. The NILFS versioning filesystem for Linux is an example.[49]
Live data
[edit]Files that are actively being updated present a challenge to back up. One way to back up live data is to temporarily quiesce them (e.g., close all files), take a "snapshot", and then resume live operations. At this point the snapshot can be backed up through normal methods.[50] A snapshot is an instantaneous function of some filesystems that presents a copy of the filesystem as if it were frozen at a specific point in time, often by a copy-on-write mechanism. Snapshotting a file while it is being changed results in a corrupted file that is unusable. This is also the case across interrelated files, as may be found in a conventional database or in applications such as Microsoft Exchange Server.[16] The term fuzzy backup can be used to describe a backup of live data that looks like it ran correctly, but does not represent the state of the data at a single point in time.[51]
Backup options for data files that cannot be or are not quiesced include:[52]
- Open file backup: Many backup software applications undertake to back up open files in an internally consistent state.[53] Some applications simply check whether open files are in use and try again later.[50] Other applications exclude open files that are updated very frequently.[54] Some low-availability interactive applications can be backed up via natural/induced pausing.
- Interrelated database files backup: Some interrelated database file systems offer a means to generate a "hot backup"[55] of the database while it is online and usable. This may include a snapshot of the data files plus a snapshotted log of changes made while the backup is running. Upon a restore, the changes in the log files are applied to bring the copy of the database up to the point in time at which the initial backup ended.[56] Other low-availability interactive applications can be backed up via coordinated snapshots. However, genuinely-high-availability interactive applications can be only be backed up via Continuous Data Protection.
Metadata
[edit]Not all information stored on the computer is stored in files. Accurately recovering a complete system from scratch requires keeping track of this non-file data too.[57]
- System description: System specifications are needed to procure an exact replacement after a disaster.
- Boot sector: The boot sector can sometimes be recreated more easily than saving it. It usually isn't a normal file and the system won't boot without it.
- Partition layout: The layout of the original disk, as well as partition tables and filesystem settings, is needed to properly recreate the original system.
- File metadata: Each file's permissions, owner, group, ACLs, and any other metadata need to be backed up for a restore to properly recreate the original environment.
- System metadata: Different operating systems have different ways of storing configuration information. Microsoft Windows keeps a registry of system information that is more difficult to restore than a typical file.
Manipulation of data and dataset optimization
[edit]It is frequently useful or required to manipulate the data being backed up to optimize the backup process. These manipulations can improve backup speed, restore speed, data security, media usage and/or reduced bandwidth requirements.
Automated data grooming
[edit]Out-of-date data can be automatically deleted, but for personal backup applications—as opposed to enterprise client-server backup applications where automated data "grooming" can be customized—the deletion[note 2][58][59] can at most[60] be globally delayed or be disabled.[61]
Compression
[edit]Various schemes can be employed to shrink the size of the source data to be stored so that it uses less storage space. Compression is frequently a built-in feature of tape drive hardware.[62]
Deduplication
[edit]Redundancy due to backing up similarly configured workstations can be reduced, thus storing just one copy. This technique can be applied at the file or raw block level. This potentially large reduction[62] is called deduplication. It can occur on a server before any data moves to backup media, sometimes referred to as source/client side deduplication. This approach also reduces bandwidth required to send backup data to its target media. The process can also occur at the target storage device, sometimes referred to as inline or back-end deduplication.
Duplication
[edit]Sometimes backups are duplicated to a second set of storage media. This can be done to rearrange the archive files to optimize restore speed, or to have a second copy at a different location or on a different storage medium—as in the disk-to-disk-to-tape capability of Enterprise client-server backup.
Encryption
[edit]High-capacity removable storage media such as backup tapes present a data security risk if they are lost or stolen.[63] Encrypting the data on these media can mitigate this problem, however encryption is a CPU intensive process that can slow down backup speeds, and the security of the encrypted backups is only as effective as the security of the key management policy.[62]
Multiplexing
[edit]When there are many more computers to be backed up than there are destination storage devices, the ability to use a single storage device with several simultaneous backups can be useful.[64] However cramming the scheduled backup window via "multiplexed backup" is only used for tape destinations.[64]
Refactoring
[edit]The process of rearranging the sets of backups in an archive file is known as refactoring. For example, if a backup system uses a single tape each day to store the incremental backups for all the protected computers, restoring one of the computers could require many tapes. Refactoring could be used to consolidate all the backups for a single computer onto a single tape, creating a "synthetic full backup". This is especially useful for backup systems that do incrementals forever style backups.
Staging
[edit]Sometimes backups are copied to a staging disk before being copied to tape.[64] This process is sometimes referred to as D2D2T, an acronym for Disk-to-disk-to-tape. It can be useful if there is a problem matching the speed of the final destination device with the source device, as is frequently faced in network-based backup systems. It can also serve as a centralized location for applying other data manipulation techniques.
Objectives
[edit]- Recovery point objective (RPO): The point in time that the restarted infrastructure will reflect, expressed as "the maximum targeted period in which data (transactions) might be lost from an IT service due to a major incident". Essentially, this is the roll-back that will be experienced as a result of the recovery. The most desirable RPO would be the point just prior to the data loss event. Making a more recent recovery point achievable requires increasing the frequency of synchronization between the source data and the backup repository.[65]
- Recovery time objective (RTO): The amount of time elapsed between disaster and restoration of business functions.[66]
- Data security: In addition to preserving access to data for its owners, data must be restricted from unauthorized access. Backups must be performed in a manner that does not compromise the original owner's undertaking. This can be achieved with data encryption and proper media handling policies.[67]
- Data retention period: Regulations and policy can lead to situations where backups are expected to be retained for a particular period, but not any further. Retaining backups after this period can lead to unwanted liability and sub-optimal use of storage media.[67]
- Checksum or hash function validation: Applications that back up to tape archive files need this option to verify that the data was accurately copied.[68]
- Backup process monitoring: Enterprise client-server backup applications need a user interface that allows administrators to monitor the backup process, and proves compliance to regulatory bodies outside the organization; for example, an insurance company in the USA might be required under HIPAA to demonstrate that its client data meet records retention requirements.[69]
- User-initiated backups and restores: To avoid or recover from minor disasters, such as inadvertently deleting or overwriting the "good" versions of one or more files, the computer user—rather than an administrator—may initiate backups and restores (from not necessarily the most-recent backup) of files or folders.
See also
[edit]About backup
- Backup software and services
- Glossary of backup terms
- Virtual backup appliance
Related topics
Notes
[edit]References
[edit]- ^ "back•up". The American Heritage Dictionary of the English Language. Houghton Mifflin Harcourt. 2018. Retrieved 9 May 2018.
- ^ S. Nelson (2011). "Chapter 1: Introduction to Backup and Recovery". Pro Data Backup and Recovery. Apress. pp. 1–16. ISBN 978-1-4302-2663-5. Retrieved 8 May 2018.
- ^ Cougias, D.J.; Heiberger, E.L.; Koop, K. (2003). "Chapter 1: What's a Disaster Without a Recovery?". The Backup Book: Disaster Recovery from Desktop to Data Center. Network Frontiers. pp. 1–14. ISBN 0-9729039-0-9.
- ^ a b c Joe Kissell (2007). Take Control of Mac OS X Backups (PDF) (Version 2.0 ed.). Ithaca, NY: TidBITS Electronic Publishing. pp. 18–20 ("The Archive", meaning information repository, including versioning), 24 (client-server), 82–83 (archive file), 112–114 (Off-site storage backup rotation scheme), 126–141 (old Retrospect terminology and GUI—still used in Windows variant), 165 (client-server), 128 (subvolume—later renamed Favorite Folder in Macintosh variant). ISBN 978-0-9759503-0-2. Archived from the original (PDF) on 1 December 2020. Retrieved 17 May 2019.
- ^ Ensuring the complete destruction of the primary site will not result in the loss of the recovery site by locating them physically far away.
".ORG Sponsorship Proposal - Technical Plan - Physical Security". ICANN. - ^ Terry Sullivan (11 January 2018). "A Beginner's Guide to Backing Up Photos". The New York Times.
a hard drive ... an established company ... declared bankruptcy ... where many ... had ...
- ^ McMahon, Mary (1 April 2019). "What Is an Information Repository?". wiseGEEK. Conjecture Corporation. Retrieved 8 May 2019.
In the sense of an approach to data management, an information repository is a secondary storage space for data.
- ^ Jeph Preece. "Online Data Backup Reviews: Why Use an Online Data Backup Service?". Top Ten Reviews. 2016.
- ^ Kyle Chin. "How to Back Up Your Data: 6 Effective Strategies to Prevent Data Loss". 2024.
- ^ "How do you backup your computer? Windows and Mac drive backup". 2023.
- ^ Scott Gilbertson. "How to Back Up Your Digital Life". 2024.
- ^ "Five key questions to ask about your backup solution". sysgen.ca. 23 March 2014. Does your company have a low tolerance to longer "data access outages" and/or would you like to minimize the time your company may be without its data?. Archived from the original on 4 March 2016. Retrieved 23 September 2015.
- ^ "Incremental Backup". Tech-FAQ. Independent Media. 13 June 2005. Archived from the original on 21 June 2016. Retrieved 10 March 2006.
- ^ Pond, James (31 August 2013). "How Time Machine Works its Magic". Apple OSX and Time Machine Tips. baligu.com. File System Event Store, Hard Links. Archived from the original on 21 June 2019. Retrieved 19 May 2019.
- ^ a b Behzad Behtash (6 May 2010). "Why Continuous Data Protection's Getting More Practical". Disaster recovery/business continuity. InformationWeek. Retrieved 12 November 2011.
A true CDP approach should capture all data writes, thus continuously backing up data and eliminating backup windows.... CDP is the gold standard—the most comprehensive and advanced data protection. But "near CDP" technologies can deliver enough protection for many companies with less complexity and cost. For example, snapshots can provide a reasonable near-CDP-level of protection for file shares, letting users directly access data on the file share at regular intervals--say, every half-hour or 15 minutes. That's certainly a higher level of protection than tape-based or disk-based nightly backups and may be all you need.
- ^ a b "Continuous data protection (CDP) explained: True CDP vs near-CDP". ComputerWeekly.com. TechTarget. July 2010. Retrieved 22 June 2019.
... copies data from a source to a target. True CDP does this every time a change is made, while so-called near-CDP does this at pre-set time intervals. Near-CDP is effectively the same as snapshotting....True CDP systems record every write and copy them to the target where all changes are stored in a log. [new paragraph] By contrast, near-CDP/snapshot systems copy files in a straightforward manner but require applications to be quiesced and made ready for backup, either via the application's backup mode or using, for example, Microsoft's Volume Shadow Copy Services (VSS).
- ^ Pond, James (31 August 2013). "How Time Machine Works its Magic". Apple OSX and Time Machine Tips. Baligu.com (as mirrored after James Pond died in 2013). Archived from the original on 21 June 2019. Retrieved 10 July 2019.
The File System Event Store is a hidden log that OSX keeps on each HFS+ formatted disk/partition of changes made to the data on it. It doesn't list every file that's changed, but each directory (folder) that's had anything changed inside it.
- ^ de Guise, P. (2009). Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. CRC Press. pp. 285–287. ISBN 978-1-4200-7639-4.
- ^ Wu, Victor (4 March 2017). "EMC RecoverPoint for Virtual Machine Overview". Victor Virtual. WuChiKin. Retrieved 22 June 2019.
The splitter splits out the Write IOs to the VMDK/RDM of a VM and sends a copy to the production VMDK and also to the RecoverPoint for VMs cluster.
- ^ "Zerto or Veeam?". RES-Q Services. March 2017. Retrieved 7 July 2019.
Zerto doesn't use snapshot technology like Veeam. Instead, Zerto deploys small virtual machines on its physical hosts. These Zerto VMs capture the data as it is written to the host and then send a copy of that data to the replication site.....However, Veeam has the advantage of being able to more efficiently capture and store data for long-term retention needs. There is also a significant pricing difference, with Veeam being cheaper than Zerto.
- ^ "Agent Related". CloudEndure.com. 2019. What does the CloudEndure Agent do?. Retrieved 3 July 2019.
The CloudEndure Agent performs an initial block-level read of the content of any volume attached to the server and replicates it to the Replication Server. The Agent then acts as an OS-level read filter to capture writes and synchronizes any block level modifications to the CloudEndure Replication Server, ensuring near-zero RPO.
- ^ Gardner, Steve (9 December 2004). "Disk to Disk Backup versus Tape – War or Truce?". Engenio. Peaceful coexistence. Archived from the original on 7 February 2005. Retrieved 26 May 2019.
- ^ a b c "Digital Data Storage Outlook 2017" (PDF). Spectra. Spectra Logic. 2017. p. 7(Solid-State), 10(Magnetic Disk), 14(Tape), 17(Optical). Archived from the original (PDF) on 7 May 2018. Retrieved 11 July 2018.
- ^ a b c Tom Coughlin (29 June 2014). "Keeping Data for a Long Time". Forbes. para. Magnetic Tapes(popular formats, storage life), para. Hard Disk Drives(active archive), para. First consider flash memory in archiving(... may not have good media archive life). Retrieved 19 April 2018.
- ^ a b c d Jacobi, John L. (29 February 2016). "Hard-core data preservation: The best media and methods for archiving your data". PC World. sec. External Hard Drives(on the shelf, magnetic properties, mechanical stresses, vulnerable to shocks), Tape, Online storage. Retrieved 19 April 2018.
- ^ "Ramp Load/Unload Technology in Hard Disk Drives" (PDF). HGST. Western Digital. November 2007. p. 3(sec. Enhanced Shock Tolerance). Retrieved 29 June 2018.
- ^ "Toshiba Portable Hard Drive (Canvio® 3.0)". Toshiba Data Dynamics Singapore. Toshiba Data Dynamics Pte Ltd. 2018. sec. Overview(Internal shock sensor and ramp loading technology). Archived from the original on 16 June 2018. Retrieved 16 June 2018.
- ^ a b "Iomega Drop Guard ™ Technology" (PDF). Hard Drive Storage Solutions. Iomega Corp. 20 September 2010. pp. 2(What is Drop Shock Technology?, What is Drop Guard Technology? (... features special internal cushioning .... 40% above the industry average)), 3(*NOTE). Retrieved 12 July 2018.
- ^ a b John Burek (15 May 2018). "The Best Rugged Hard Drives and SSDs". PC Magazine. Ziff Davis. What Exactly Makes a Drive Rugged?(When a drive is encased ... you're mostly at the mercy of the drive vendor to tell you the rated maximum drop distance for the drive). Retrieved 4 August 2018.
- ^ Justin Krajeski; Kimber Streams (20 March 2017). "The Best Portable Hard Drive". The New York Times. Archived from the original on 31 March 2017. Retrieved 4 August 2018.
- ^ "Best Long-Term Data Archive Solutions". Iron Mountain. Iron Mountain Inc. 2018. sec. More Reliable(average mean time between failure ... rates, best practice for migrating data). Retrieved 19 April 2018.
- ^ Kissell, Joe (2011). Take Control of Backing Up Your Mac. Ithaca NY: TidBITS Publishing Inc. p. 41(Deduplication). ISBN 978-1-61542-394-1. Retrieved 17 September 2019.
- ^ "Symantec Shows Backup Exec a Little Dedupe Love; Lays out Source Side Deduplication Roadmap – DCIG". DCIG. 7 July 2009. Archived from the original on 4 March 2016. Retrieved 26 February 2016.
- ^ "Veritas NetBackup™ Deduplication Guide". Veritas. Veritas Technologies LLC. 2016. Retrieved 26 July 2018.
- ^ S. Wan; Q. Cao; C. Xie (2014). "Optical storage: An emerging option in long-term digital preservation". Frontiers of Optoelectronics. 7 (4): 486–492. doi:10.1007/s12200-014-0442-2. S2CID 60816607.
- ^ Q. Zhang; Z. Xia; Y.-B. Cheng; M. Gu (2018). "High-capacity optical long data memory based on enhanced Young's modulus in nanoplasmonic hybrid glass composites". Nature Communications. 9 (1): 1183. Bibcode:2018NatCo...9.1183Z. doi:10.1038/s41467-018-03589-y. PMC 5864957. PMID 29568055.
- ^ Bärwaldt, Erik (2014). "Full Control » Linux Magazine". Linux Magazine.
- ^ "5. Conditions That Affect CDs and DVDs • CLIR". CLIR.
- ^ Gérard Poirier; Foued Berahou (3 March 2008). "Journal de 20 Heures". Institut national de l'audiovisuel. approximately minute 30 of the TV news broadcast. Retrieved 3 March 2008.
- ^ "Archival Gold CD-R "300 Year Disc" Binder of 10 Discs with Scratch Armor Surface". Delkin Devices. Delkin Devices Inc. Archived from the original on 27 September 2013.
- ^ "Optical Disc Archive Generation 2" (PDF). Optical Disc Archive. Sony. April 2016. p. 12(World’s First 8-Channel Optical Drive Unit). Retrieved 15 August 2019.
- ^ R. Micheloni; P. Olivo (2017). "Solid-State Drives (SSDs)". Proceedings of the IEEE. 105 (9): 1586–88. doi:10.1109/JPROC.2017.2727228.
- ^ "Remote Backup". EMC Glossary. Dell, Inc. Retrieved 8 May 2018.
Effective remote backup requires that production data be regularly backed up to a location far enough away from the primary location so that both locations would not be affected by the same disruptive event.
- ^ Stackpole, B.; Hanrion, P. (2007). Software Deployment, Updating, and Patching. CRC Press. pp. 164–165. ISBN 978-1-4200-1329-0. Retrieved 8 May 2018.
- ^ Gnanasundaram, S.; Shrivastava, A., eds. (2012). Information Storage and Management: Storing, Managing, and Protecting Digital Information in Classic, Virtualized, and Cloud Environments. John Wiley and Sons. p. 255. ISBN 978-1-118-23696-3. Retrieved 8 May 2018.
- ^ Lee (25 January 2017). "What to backup – a critical look at your data". Irontree Blog. Irontree Internet Services CC. Retrieved 8 May 2018.
- ^ Preston, W.C. (2007). Backup & Recovery: Inexpensive Backup Solutions for Open Systems. O'Reilly Media, Inc. pp. 111–114. ISBN 978-0-596-55504-7. Retrieved 8 May 2018.
- ^ Preston, W.C. (1999). Unix Backup & Recovery. O'Reilly Media, Inc. pp. 73–91. ISBN 978-1-56592-642-4. Retrieved 8 May 2018.
- ^ "NILFS Home". NILFS Continuous Snapshotting System. NILFS Community. 2019. Retrieved 22 August 2019.
- ^ a b Cougias, D.J.; Heiberger, E.L.; Koop, K. (2003). "Chapter 11: Open file backup for databases". The Backup Book: Disaster Recovery from Desktop to Data Center. Network Frontiers. pp. 356–360. ISBN 0-9729039-0-9.
- ^ Liotine, M. (2003). Mission-critical Network Planning. Artech House. p. 244. ISBN 978-1-58053-559-5. Retrieved 8 May 2018.
- ^ de Guise, P. (2009). Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. CRC Press. pp. 50–54. ISBN 978-1-4200-7639-4.
- ^ "Open File Backup Software for Windows". Handy Backup. Novosoft LLC. 8 November 2018. Retrieved 29 November 2018.
- ^ Reitshamer, Stefan (5 July 2017). "Troubleshooting backing up open/locked files on Windows". Arq Blog. Haystack Software. Stefan Reitshamer is the principal developer of Arq. Retrieved 29 November 2018.
- ^ Boss, Nina (10 December 1997). "Oracle Tips Session #3: Oracle Backups". www.wisc.edu. University of Wisconsin. Archived from the original on 2 March 2007. Retrieved 1 December 2018.
- ^ "What is ARCHIVE-LOG and NO-ARCHIVE-LOG mode in Oracle and the advantages & disadvantages of these modes?". Arcserve Backup. Arcserve. 27 September 2018. Retrieved 29 November 2018.
- ^ Grešovnik, Igor (April 2016). "Preparation of Bootable Media and Images". Archived from the original on 25 April 2016. Retrieved 21 April 2016.
- ^ Tridgell, Andrew; Mackerras, Paul; Davison, Wayne. "rsync(1) - Linux man page". linux.die.net.
- ^ "Archive maintenance". Code42 Support. 2023.
- ^ Pond, James (2 June 2012). "12. Should I delete old backups? If so, How?". Time Machine. baligu.com. Green box, Gray box. Archived from the original on 27 October 2019. Retrieved 21 June 2019.
- ^ Kissell, Joe (12 March 2019). "The Best Online Cloud Backup Service". wirecutter. The New York Times. Next, there’s file retention. Retrieved 21 June 2019.
- ^ a b c D. Cherry (2015). Securing SQL Server: Protecting Your Database from Attackers. Syngress. pp. 306–308. ISBN 978-0-12-801375-5. Retrieved 8 May 2018.
- ^ Backups tapes a backdoor for identity thieves Archived 5 April 2016 at the Wayback Machine (28 April 2004). Retrieved 10 March 2007
- ^ a b c Preston, W.C. (2007). Backup & Recovery: Inexpensive Backup Solutions for Open Systems. O'Reilly Media, Inc. pp. 219–220. ISBN 978-0-596-55504-7. Retrieved 8 May 2018.
- ^ "Recovery Point Objective (Definition)". ARL Risky Thinking. Albion Research Ltd. 2007. Retrieved 4 August 2019.
- ^ "Recovery Time Objective (Definition)". ARL Risky Thinking. Albion Research Ltd. 2007. Retrieved 4 August 2019.
- ^ a b Little, D.B. (2003). "Chapter 2: Business Requirements of Backup Systems". Implementing Backup and Recovery: The Readiness Guide for the Enterprise. John Wiley and Sons. pp. 17–30. ISBN 978-0-471-48081-5. Retrieved 8 May 2018.
- ^ "How do the "verify" and "write checksums to media" processes work and why are they necessary?". Veritas Support. Veritas.com. 15 October 2015. Write checksums to media. Retrieved 16 September 2019.
- ^ HIPAA Advisory Archived 11 April 2007 at the Wayback Machine. Retrieved 10 March 2007
External links
[edit]Backup
View on GrokipediaFundamentals
Definition and Purpose
Backup refers to the process of creating copies of computer data stored in a separate location from the originals, enabling restoration in the event of data loss, corruption, or disaster.[2][6] This practice ensures that critical information remains accessible and recoverable, forming a foundational element of data protection strategies. Key concepts include redundancy, which involves maintaining multiple identical copies of data to mitigate single points of failure, and point-in-time recovery, allowing restoration to a specific moment before an incident occurred.[7][8] Backups integrate into the broader data lifecycle—encompassing creation, usage, archival, and deletion—by preserving data integrity and availability throughout these phases.[9] The primary purposes of backups are to support disaster recovery, ensuring systems and data can be restored after events like hardware failures or natural disasters; to facilitate business continuity by minimizing operational downtime; and to meet regulatory compliance requirements for data retention and auditability.[10][11][12] They also protect against human errors, such as accidental deletions, and cyber threats including ransomware and cyberattacks, which can encrypt or destroy data.[13][14] Historically, data backups emerged in the 1950s with the advent of mainframe computers, initially relying on punch cards for data storage and processing before transitioning to magnetic tape systems like the IBM 726 introduced in 1952, which offered higher capacity and reliability.[15][16] In 2025, amid explosive data growth driven by artificial intelligence, Internet of Things devices, and cloud computing, global data volume is estimated at 181 zettabytes, heightening the need for robust backup mechanisms to manage this scale and prevent irrecoverable losses.[17]Historical Development
The earliest forms of data backup in computing emerged in the 1940s and 1950s alongside vacuum tube-based systems, where punch cards and paper tape served as primary storage and archival media.[18] By the 1930s, IBM was already processing up to 10 million punch cards daily for data handling, a practice that persisted into the 1960s and 1970s for batch processing and rudimentary backups in mainframe environments.[19] Magnetic tape, patented in 1928 but widely adopted by IBM in the 1950s, revolutionized backup by enabling faster sequential data access and greater capacity compared to paper-based methods, often inspired by adaptations from audio recording technologies like those in vacuum cleaners.[20] These tapes became standard for archiving in the 1960s and 1970s, supporting the growing needs of early enterprise computing. In the 1970s and 1980s, backup practices advanced with the proliferation of minicomputers and the introduction of cartridge-based magnetic tape systems, such as IBM's 3480 format launched in 1984, which offered compact, high-density storage for mainframes and improved reliability over reel-to-reel tapes.[16] The rise of personal computers and Unix systems in the late 1970s spurred software innovations; for instance, the Unix 'dump' utility appeared in Version 6 Unix around 1975 for filesystem-level backups, while 'tar' (tape archive) was introduced in Seventh Edition Unix in 1979 to bundle files for tape storage.[21] By the 1980s and 1990s, hard disk drives became affordable for backups, shifting from tape-only workflows, and RAID (Redundant Array of Independent Disks) was conceptualized in 1987 by researchers at the University of California, Berkeley, providing fault-tolerant disk arrays that enhanced data protection through redundancy.[22] Incremental backups, which capture only changes since the prior backup to reduce storage and time, gained traction during this era, with early implementations in Unix tools and a key patent for optimized incremental techniques filed in 1989.[23] The 2000s marked a transition to disk-to-disk backups, driven by falling hard drive costs and the need for faster recovery; by the early decade, disk replaced tape as the preferred primary backup medium for many enterprises, enabling near-line storage for quicker access.[24] Virtualization further transformed backups, with VMware's ESX Server released in 2001 introducing bare-metal hypervisors that supported VM snapshots for point-in-time recovery without full system shutdowns.[25] Cloud storage emerged as a milestone with Amazon S3's launch in 2006, offering scalable, offsite object storage that began integrating with backup workflows for remote replication.[26] Data deduplication, which eliminates redundant data blocks to optimize storage, saw significant adoption starting around 2005, with Permabit Technology Corporation pioneering inline deduplication solutions for virtual tape libraries to address exploding data volumes.[27] From the 2010s onward, backups evolved to handle big data and hybrid cloud environments, incorporating features like automated orchestration across on-premises and cloud tiers for resilience against outages.[15] The 2017 WannaCry ransomware attack, which encrypted data on over 200,000 systems worldwide, underscored vulnerabilities in traditional backups, prompting a surge in cyber-resilient strategies such as air-gapped and immutable storage to prevent tampering.[28] In the 2020s, ransomware incidents escalated, with disclosed attacks rising 34% from 2020 to 2022, continuing through 2024 when 59% of organizations were affected, and into 2025.[29][30] This has driven adoption of immutable backups that lock data versions against modification for a defined period. Trends now emphasize AI-optimized backups for predictive anomaly detection and zero-trust models integrated into storage, as highlighted in Gartner's 2025 Hype Cycle for Storage Technologies, which positions cyberstorage and AI-driven data management as maturing innovations for enhanced security and efficiency.[31][32]Backup Strategies and Rules
The 3-2-1 Backup Rule
The 3-2-1 backup rule serves as a foundational best practice for data redundancy and recoverability, recommending the maintenance of three total copies of critical data: the original production copy plus two backups. These copies must reside on two distinct types of storage media to guard against media-specific failures, such as disk crashes or tape degradation, while ensuring at least one copy is stored offsite or disconnected from the primary network to mitigate risks from physical disasters, theft, or localized cyberattacks.[33][34][35] In light of escalating cyber threats, particularly ransomware that targets mutable backups, the rule has evolved by 2025 into the 3-2-1-1-0 framework. This extension incorporates an additional immutable or air-gapped copy—isolated via physical disconnection or unalterable storage policies—to prevent encryption or deletion by malware, alongside a mandate for zero recovery errors achieved through routine verification testing. Air-gapped solutions, such as offline tapes, or cloud-based isolated repositories enhance resilience by breaking the attack chain, ensuring clean restores even in sophisticated breach scenarios.[33][36][37] This strategy offers a balanced approach to data protection, optimizing costs through minimal redundancy while preserving accessibility for rapid recovery and providing robust safeguards against diverse failure modes. For instance, a typical implementation might involve the original data on a local server disk, a backup on external hard drives or NAS, and an offsite copy in cloud storage, thereby distributing risk across hardware types and locations without requiring excessive resources.[38][39] Implementing the 3-2-1 rule begins with evaluating data criticality to focus efforts on high-value assets, such as business records or application databases, using tools like risk assessments to classify information. Next, choose media diversity based on factors like capacity, speed, and compatibility—ensuring no single failure mode affects all copies—while automating backups via software that supports multiple destinations. Finally, establish offsite storage through geographic separation, such as remote data centers or compliant cloud providers, to confirm isolation from primary site vulnerabilities.[37][39][35] According to the 2025 State of Backup and Recovery Report, variants of the 3-2-1 rule are increasingly adopted amid rising threats, with only 50% of organizations currently aligning actual recovery times with their RTO targets, underscoring the rule's role in enhancing overall resilience.[40]Rotation and Retention Policies
Rotation schemes define the systematic cycling of backup media or storage to ensure regular data protection while minimizing resource use. One widely adopted approach is the Grandfather-Father-Son (GFS) model, which organizes backups into hierarchical cycles: daily incremental backups (sons) capture changes from the previous day, weekly full backups (fathers) provide a comprehensive snapshot at the end of each week, and monthly full backups (grandfathers) serve as long-term anchors retained for extended periods, such as 12 months.[41][42] This scheme balances short-term recovery needs with archival efficiency by rotating media sets, typically using separate tapes or disks for each level to avoid overwrites.[43] Another rotation strategy is the Tower of Hanoi scheme, inspired by the mathematical puzzle, which optimizes incremental chaining for extended retention with limited media. In this method, backups occur on a recursive schedule—every other day on the first media set, every fourth day on the second, every eighth on the third, and so on—allowing up to 2^n - 1 days of coverage with n media sets while ensuring each backup depends only on the prior full or relevant incremental for restoration.[44][45] This approach reduces media wear on frequently used sets and supports efficient space utilization in environments with high daily change rates.[46] Retention policies govern how long backups are kept before deletion or archiving, primarily driven by regulatory compliance to prevent data loss and support audits. For instance, under the General Data Protection Regulation (GDPR) in the European Union, organizations must retain personal data only as long as necessary for the specified purpose, with retention periods determined by the data's purpose and applicable sector-specific or national laws (e.g., 5-10 years for certain financial records under related regulations).[47][48] Similarly, the Health Insurance Portability and Accountability Act (HIPAA) in the United States mandates retention of protected health information documentation for at least six years from creation or the last effective date.[49] To enforce immutability during these periods, Write Once Read Many (WORM) storage is employed, where data can be written once but not altered or deleted until the retention term expires, safeguarding against ransomware or accidental overwrites.[50][51] Several factors influence the design of rotation and retention policies, including the assessed value of the data, potential legal holds that extend retention beyond standard periods, and the ongoing costs of storage infrastructure. High-value data, such as intellectual property, may warrant longer retention to mitigate recovery risks, while legal holds—triggered by litigation or investigations—can indefinitely pause deletions.[52] Storage costs further constrain policies, as prolonged retention increases expenses for cloud or on-premises media, prompting tiered approaches like moving older backups to cheaper archival tiers.[53] In 2025, emerging trends leverage AI-driven dynamic retention, where machine learning algorithms automatically adjust policies based on real-time threat detection and data usage patterns to optimize protection without excessive storage bloat.[54][55] A common example of rotation implementation is a weekly full backup combined with daily incrementals, where full backups occur every Friday to reset the chain, and incrementals run Monday through Thursday, retaining the prior week's full for quick point-in-time recovery.[56] To estimate storage needs under such a policy, organizations use formulas like Total space = (Full backup size × Number of full backups retained) + (Average incremental size × Number of days retained), accounting for deduplication ratios that can reduce effective usage by 50-90% depending on data redundancy.[57][58] Challenges in these policies arise from balancing extended retention with deduplication technologies, as long-term archives often cannot share metadata across active and retention tiers, potentially doubling storage demands and complicating space reclamation when deleting expired backups.[59] This tension requires careful configuration to avoid compliance failures or unexpected cost overruns, especially in deduplicated environments where inter-backup dependencies limit aggressive pruning.[60]Data Selection and Extraction
Targeting Files and Applications
Selecting files and applications for backup involves evaluating their criticality to business operations or personal use, such as user-generated documents, configuration files, and databases that cannot be easily recreated, while excluding transient data like temporary files to optimize storage and performance.[61] Critical items are prioritized based on potential impact from loss, with user files in home directories often targeted first due to their unique value, whereas system and application binaries are typically omitted as they can be reinstalled from original sources.[61] Exclusion patterns, such as*.tmp or *.log, are applied to skip junk or ephemeral files, reducing backup size without compromising recoverability.[62]
At the file level, backups offer granularity by targeting individual files, specific directories, or patterns, allowing for efficient synchronization of only changed or selected items. Tools like rsync enable this selective approach through options such as --include for specific paths (e.g., --include='docs/*.pdf') and --exclude for unwanted elements (e.g., --exclude='temp/'), facilitating incremental transfers over local or remote destinations while preserving permissions and timestamps.[62] This method supports directories as units for broader coverage, such as syncing an entire /home/user/projects/ folder, but allows fine-tuning to avoid unnecessary data.[63]
For applications, backups are tailored to their architecture: databases like MySQL are often handled via logical dumps using mysqldump, which generates SQL scripts to recreate tables, views, and data (e.g., mysqldump --all-databases > backup.sql), ensuring consistency without halting operations when combined with transaction options like --single-transaction.[64] Email servers employing IMAP protocols can be backed up by exporting mailbox contents to standard formats like MBOX or EML using tools that connect via IMAP, preserving folder structures and attachments for archival.[65] Virtual machines (VMs) are commonly treated as single image files, capturing the entire disk state (e.g., VMDK or VHD) through host-level snapshots to enable quick restoration of the full environment.[66]
Challenges arise with large files exceeding 1TB, such as high-definition videos, where bandwidth constraints and incompressible data types prolong initial uploads and recovery times, often necessitating hybrid strategies like disk-to-disk seeding before cloud transfer.[67] In distributed systems, data sprawl across hybrid environments complicates visibility and consistency, as exponential growth in volume—projected to reach 181 zettabytes globally by 2025—strains backup processes and increases the risk of incomplete captures.[17] By 2025, backing up SaaS applications like Office 365 requires API-based connectors for automated extraction of Exchange, OneDrive, and Teams data, with tools configuring OAuth authentication to pull items without on-premises agents.[68]
Best practices emphasize prioritizing via Recovery Point Objective (RPO), the maximum tolerable data loss interval, targeting under 1 hour for critical applications like databases and email to minimize business disruption through frequent incremental or continuous backups.[69] This approach integrates with broader filesystem backups for comprehensive coverage, ensuring selected files and apps align with overall data protection goals.[61]
Filesystem and Volume Backups
Filesystem backups involve creating copies of entire filesystem structures, preserving the hierarchical organization of directories and files as defined by the underlying filesystem format. Common filesystems such as NTFS, used in Windows environments, employ a Master File Table (MFT) to manage metadata in a hierarchical tree, while ext4, prevalent in Linux systems, utilizes inodes and block groups to organize data within a root directory structure. These hierarchical setups enable efficient navigation and access, but backups must account for the filesystem's integrity mechanisms, including journaling, which logs pending changes to prevent corruption during power failures or crashes. Journaling in both NTFS and ext4 ensures transactional consistency by allowing recovery to a known state without full rescans.[70] Backups of filesystems can occur at the file level, which copies individual files and directories while traversing the hierarchy, or at the block level, which images raw data blocks on the storage device regardless of filesystem boundaries. File-level backups are suitable for selective preservation but may miss filesystem-specific attributes, whereas block-level approaches capture the entire structure atomically, ideal for restoring to the exact original state. Tools like rsync for file-level operations or dd for block-level raw imaging facilitate these processes on Unix-like systems. Volume backups extend filesystem backups to logical volumes, such as those managed by Logical Volume Manager (LVM) in Linux, which abstract physical storage into resizable, snapshot-capable units. LVM snapshots create point-in-time copies by redirecting writes to a separate area, allowing backups without interrupting live operations; only changed blocks are stored post-snapshot, minimizing space usage to typically 3-5% of the original volume for low-change scenarios. The dd command is commonly used for raw imaging of volumes, producing bit-for-bit replicas suitable for disaster recovery. In virtualization environments, integration with tools like Hyper-V exports enables volume-level backups of virtual machines by capturing configuration files (.VMCX), state (.VMRS), and data volumes using Volume Shadow Copy Service (VSS) or WMI-based methods for scalable, host-level operations without guest agent installation.[71][72] To ensure integrity, backups incorporate checksum verification using algorithms like MD5 or SHA-256, which generate fixed-length hashes of data blocks or files to detect alterations during transfer or storage. During the backup process, the source hash is compared against the backup's hash; mismatches indicate corruption, prompting re-backup or alerts. This method verifies completeness and unaltered state, particularly crucial for large-scale operations where bit errors can occur.[73] Challenges in filesystem and volume backups include managing mounted versus unmounted states: mounted systems risk inconsistency from concurrent writes, necessitating quiescing or snapshots, while unmounted volumes ensure atomicity but require downtime. Enterprise-scale volumes, reaching petabyte sizes, amplify issues like prolonged backup windows, bandwidth limitations, and storage scalability, often addressed through incremental block tracking or distributed systems. Virtualization adds complexity, as Hyper-V exports must handle shared virtual disks and cluster integrations without performance degradation. Unlike selective file backups, which target specific content and may omit structural elements, filesystem and volume backups capture comprehensive attributes including file permissions, ownership (UID/GID), and empty directories to maintain the exact hierarchy and access controls upon restoration. This holistic approach ensures reproducibility of the environment, such as preserving ACLs in NTFS or POSIX permissions in ext4. Backup size estimation accounts for compression, approximated by the formula , where the ratio (typically 0.2-0.5 for mixed data) reflects the reduction factor based on data patterns; for instance, text-heavy volumes achieve higher ratios than already-compressed media.[74][75]Handling Live Data and Metadata
Backing up live data, which involves active systems with open files and dynamically changing databases, poses significant challenges due to the risk of capturing inconsistent states during the process. Open files locked by running applications may prevent complete reads, while databases like SQL Server can experience mid-transaction modifications, leading to partial or corrupted data in the backup if not addressed.[76] To mitigate these issues, operating systems provide specialized mechanisms: in Windows environments, the Volume Shadow Copy Service (VSS) enables the creation of point-in-time shadow copies by coordinating with application writers to flush buffers and ensure consistency without interrupting operations.[77] Similarly, in Linux systems, the Logical Volume Manager (LVM) supports snapshot creation, allowing a frozen view of the volume to be backed up while the original continues to serve live workloads, as commonly used for databases like SQL Server on Red Hat Enterprise Linux.[78][79] Handling metadata alongside live data is essential for maintaining restoration fidelity, as it includes critical attributes such as timestamps, access control lists (ACLs), and extended attributes that govern file permissions, ownership, and security contexts. Failure to preserve these elements can result in restored files lacking proper access rights or audit trails, complicating recovery and potentially exposing systems to security vulnerabilities.[80] Tools designed for filesystems like XFS emphasize capturing these metadata components to ensure accurate reconstruction, particularly in environments requiring forensic recovery.[81] Techniques for live backups prioritize minimal disruption through hot backups, which operate online by temporarily switching databases to a consistent mode without downtime, and quiescing, which pauses application I/O to synchronize data on disk.[82] In virtualized setups like VMware, quiescing leverages guest tools to freeze file systems and application states, enhancing consistency for running workloads.[83] Recent advancements in container orchestration, such as Kubernetes persistent volume snapshots, enable zero-downtime backups by leveraging CSI drivers for atomic captures, a practice increasingly adopted in 2025 for scalable cloud-native applications.[84] However, risks remain if these methods are misapplied, including data inconsistency from uncommitted SQL transactions that could crash during backup, leading to irrecoverable corruption upon restore.[76] Best practices recommend application-aware tools to address these complexities, such as Oracle Recovery Manager (RMAN), which performs hot backups by integrating with the database to handle redo logs and ensure transactional integrity while including metadata for full fidelity.[85][86] Organizations should always verify metadata inclusion in backup configurations to support not only operational recovery but also forensic analysis, testing restores periodically to confirm consistency.[81]Backup Methods
Full and System Imaging Backups
A full backup creates a complete, independent copy of all selected data, including files, folders, and system components, without relying on previous backups.[87] This approach ensures straightforward restoration, as the entire dataset can be recovered independently, eliminating dependencies on other backup sets. However, full backups are resource-intensive, requiring significant time and storage space due to the duplication of all data each time.[14] System imaging extends full backups by capturing an exact replica of entire disks or partitions, enabling bootable operating system restores and bare-metal recovery on dissimilar hardware.[89] Tools such as Clonezilla provide open-source disk cloning capabilities for this purpose, while commercial solutions like Acronis True Image support user-friendly imaging for complete system migration and recovery.[90][91] Full backups and system imaging are commonly used to establish initial baselines for data protection and facilitate disaster recovery, where rapid restoration of an entire environment is critical.[14] In backup rotations, they are typically performed weekly to balance completeness with efficiency.[14] Technically, system imaging can operate at the block level, copying raw disk sectors for precise replication including unused space, or at the file level, which targets only allocated files but may overlook low-level structures.[92] Block-level imaging is particularly effective for handling partitions and bootloaders like GRUB, ensuring the master boot record and partition tables are preserved for bootable restores.[89] In 2025, advancements in full backups and system imaging emphasize seamless integration with hypervisors such as VMware and Hyper-V, allowing automated VM imaging for hybrid environments.[93] For a 1TB system using SSD storage, a full backup typically takes 2-4 hours, depending on hardware and network speeds.[94] Full backups often serve as the foundational baseline in incremental chains for ongoing protection.[95]Incremental and Differential Backups
Incremental backups capture only the data that has changed since the most recent previous backup, whether that was a full backup or another incremental one.[96] This approach minimizes backup time and storage usage by avoiding redundant copying of unchanged data. However, it creates a dependency chain where restoring to a specific point requires the initial full backup followed by all subsequent incremental backups in sequence, potentially complicating and prolonging the recovery process.[97] The total size of such a chain is calculated as the size of the full backup plus the sum of the sizes of all changes captured in each incremental backup, expressed as , where represents the changed data volume in the -th incremental backup.[98] Differential backups, in contrast, record all changes that have occurred since the last full backup, making them cumulative rather than dependent on prior differentials.[99] This method simplifies restoration, as only the most recent full backup and the latest differential are needed to recover data to the desired point. However, differential backups grow larger over time without a new full backup, as they accumulate all modifications since the baseline, leading to increased storage demands compared to incremental methods.[100] Incremental backups generally require less storage space than differentials, achieving significant savings due to their narrower scope of changes.[101] Implementation of these backups relies on technologies that efficiently track modifications. For instance, VMware's Changed Block Tracking (CBT) feature identifies altered data blocks on virtual machine disks since the last backup, enabling faster incremental operations by processing only those blocks.[102] Open-source tools like Duplicati support incremental backups by scanning for new or modified files and blocks, using deduplication to further optimize storage across runs.[103] The primary advantages of incremental backups include reduced backup duration and storage footprint, making them ideal for frequent operations in high-change environments, though their chain dependency can extend restore times. Differential backups offer quicker recoveries at the cost of progressively larger backup sizes and longer creation times after extended periods. In 2025, AI-driven optimizations are enhancing these methods by predicting change patterns—such as data modification rates in databases or filesystems—to dynamically adjust backup scopes and schedules.[104] An advanced variant, incremental-forever backups, eliminates the need for periodic full backups after the initial one by using reverse incrementals or synthetic methods to create point-in-time restores efficiently, reducing storage and bandwidth while maintaining recoverability. This approach is gaining traction in 2025 for cyber-resilient environments.[104] A common strategy involves performing a weekly full backup followed by daily incrementals, which can significantly lower overall storage needs compared to full-only schedules.[105]Continuous Data Protection
Continuous Data Protection (CDP) is a backup methodology that captures and records every data change in real-time or near-real-time, enabling recovery to virtually any point in time without significant data loss.[106] This approach maintains a continuous journal of modifications, allowing users to roll back to a precise moment, such as before a specific transaction or error, which is essential for environments where even seconds of data loss can be costly.[107] Unlike near-continuous data protection, which performs backups at fixed intervals like every 15 minutes, true CDP ensures all changes are immediately replicated, achieving a recovery point objective (RPO) approaching zero seconds.[108] Key techniques include journaling, where every write operation is logged for granular rollback; log shipping, which periodically or continuously transfers transaction logs to a secondary system for replay; database replication using mechanisms like MySQL binary logs (binlogs) to mirror changes in real-time; and frequent snapshots that capture incremental states without interrupting operations.[109][110] These methods collectively minimize data gaps by treating backups as an ongoing process rather than periodic events.[111] CDP is particularly suited for high-availability applications in sectors like finance, where it protects transaction records and ensures regulatory compliance by preventing loss of sensitive client data during outages or cyberattacks.[112] As of 2025, emerging trends in data protection include AI-enhanced systems with anomaly detection for real-time safeguarding, applicable to Internet of Things (IoT) deployments handling vast sensor data.[113][114] Implementation often relies on specialized tools such as Zerto, which provides journal-based CDP for virtualized environments with continuous replication, or Dell PowerProtect, which supports real-time data protection across hybrid infrastructures.[115][116] However, challenges include substantial bandwidth demands for sustaining continuous synchronization, particularly in distributed setups, necessitating dedicated networks or compression to mitigate performance impacts.[109][117] Compared to incremental backups, which offer finer granularity over full backups but still operate on schedules that can result in hours of potential data loss, CDP reduces RPO to minutes or seconds through ongoing capture.[118] Storage efficiency is achieved via deduplicated change logs in the journal, which retain only unique modifications rather than full copies, optimizing space while preserving point-in-time recoverability.[107]Storage Media and Locations
Local Media Options
Local media options encompass on-premises storage solutions that enable direct, physical access to backup data without reliance on external networks. These include magnetic tapes, hard disk drives (HDDs), solid-state drives (SSDs), and optical discs, each offering distinct trade-offs in capacity, access speed, cost, and longevity suitable for various backup scenarios. Magnetic tape remains a cornerstone for high-capacity, cost-effective backups, particularly in enterprise environments requiring archival storage. The Linear Tape-Open (LTO) standard, with LTO-9 as the prevailing format throughout much of 2025 and LTO-10 announced in November 2025 with 40 TB native capacity per cartridge (shipping Q1 2026), provides 18 TB of native capacity per LTO-9 cartridge, expandable to 45 TB with compression, at a native transfer rate of 400 MB/s.[119][120][121] Its advantages include low cost per gigabyte—often under $0.01/GB—and suitability for sequential data writes, making it ideal for full backups of large datasets. However, the sequential access nature limits random read/write performance, requiring full tape scans for data retrieval, which can take hours for terabyte-scale volumes. LTO tapes also boast an archival lifespan of up to 30 years under optimal conditions, far exceeding many digital alternatives for long-term retention.[122] Hard disk drives offer versatile local storage for both active and archival backups, often deployed in arrays for enhanced capacity and reliability. Traditional HDDs provide high density at low cost, with enterprise models featuring mean time between failures (MTBF) ratings around 1 to 2.5 million hours, ensuring durability in continuous operation. However, external HDDs are particularly susceptible to failure from mechanical wear over time or physical impacts such as shocks, necessitating regular backups to additional media to mitigate the risk of data loss.[123] They are commonly integrated into Network Attached Storage (NAS) devices for shared access or Storage Area Network (SAN) systems for block-level performance in data centers. Redundancy is achieved through RAID configurations, such as RAID 6 (tolerating up to two drive failures) or RAID 10 (balancing speed and redundancy), which maintain data integrity. For faster access, NVMe-based SSDs serve as local backup targets, delivering sequential write speeds exceeding 7 GB/s but at a premium cost of $0.05–$0.10/GB, making them preferable for incremental backups or virtual machine imaging where speed trumps capacity; quad-level cell (QLC) NAND variants offer higher capacities at reduced costs for archival use.[124] Optical media, particularly Blu-ray discs, support write-once archival backups with capacities up to 100 GB per quad-layer disc in BDXL format, suitable for small-scale or compliance-driven retention.[125] Archival-grade variants, like M-DISC, extend readability to 1000 years, though practical use is limited by slower write speeds (around 20–50 MB/s) and manual handling requirements. Selecting local media involves balancing capacity, access speed, and lifespan against use case needs; for instance, tapes excel in write speeds of 400 MB/s for bulk transfers but lag in retrieval compared to HDDs or SSDs offering random access under 1 ms. In 2025, hybrid NAS systems scale to petabyte levels—such as QNAP's 60-bay enclosures exceeding 1 PB—combining HDDs with SSD caching for optimized backup workflows. These options form the local component of strategies like the 3-2-1 rule, ensuring at least one onsite copy for rapid recovery.[126] Environmental factors critically influence media reliability; magnetic tapes require climate-controlled storage at 15–25°C and 20–50% relative humidity to prevent binder degradation, with stable conditions minimizing distortion. HDDs and SSDs demand vibration-resistant enclosures—HDDs tolerate up to 0.5 G during operation—to avoid mechanical failure, alongside cool, dry environments (5–35°C, <60% RH) for archival shelf life exceeding 5 years when powered off.[127][128][129]Remote and Cloud Storage Services
Remote backup services enable organizations to store data copies at offsite locations via network protocols, enhancing protection against localized threats such as fires or floods by providing geographic diversity.[130] These services often utilize secure file transfer protocols like FTP (File Transfer Protocol) and SFTP (Secure File Transfer Protocol), where SFTP employs SSH encryption to safeguard data during transmission to remote vaults or servers.[131] Dedicated appliances, such as those integrated with IBM Systems Director, facilitate automated backups to remote SFTP servers, ensuring reliable offsite replication without manual intervention.[132] By distributing data across multiple geographic regions, these approaches mitigate risks from site-specific disasters, allowing quicker recovery and business continuity.[133] Cloud storage services have become a cornerstone for scalable backups, offering virtually unlimited capacity and automated management through providers like Amazon Web Services (AWS) S3, Microsoft Azure Blob Storage, and Google Cloud Storage.[134] These platforms feature tiered storage options tailored to access frequency and cost efficiency: hot tiers for frequently accessed data, cool or cold tiers for less urgent retrievals, and archival tiers for long-term retention with retrieval times ranging from hours to days.[135] For instance, AWS S3's standard (hot) tier is priced at approximately $0.023 per GB per month (US East region, as of November 2025), while archival options like S3 Glacier Deep Archive drop to around $0.00099 per GB per month, enabling cost-effective scaling for backup workloads.[136] Azure Blob and Google Cloud Storage follow similar models, with hot tiers at about $0.0184 and $0.020 per GB per month, respectively (US East, as of November 2025), allowing users to balance performance and expense based on data lifecycle needs.[137] As of 2025, advancements in backup technologies emphasize multi-cloud strategies to avoid single-provider dependencies and leverage the strengths of multiple platforms for redundancy.[138] Edge computing backups integrate local processing at distributed sites to reduce latency before syncing to central clouds, supporting real-time data protection in IoT and remote operations.[139] Integration with Software-as-a-Service (SaaS) environments has deepened, exemplified by Veeam's solutions for AWS, which automate backups of cloud-native workloads like EC2 instances and S3 buckets while ensuring compliance and rapid restoration.[140] These developments, driven by rising cyber threats, promote hybrid architectures that combine on-premises, edge, and multi-cloud elements for comprehensive resilience.[54] Security in remote and cloud backups prioritizes robust protections, with encryption in transit via TLS 1.3 ensuring data confidentiality during uploads and downloads across networks.[141] Compliance standards like SOC 2, which audits controls for security and availability, are widely adopted by major providers to verify trustworthy operations.[142] However, challenges persist, including latency for transferring large datasets over wide-area networks, which can extend initial backup times from days to weeks depending on bandwidth.[143] Vendor lock-in poses another risk, as proprietary formats and APIs may complicate data migration between providers, potentially increasing long-term costs and limiting flexibility.[144] Implementation of remote and cloud backups often begins with seeding the initial dataset to accelerate setup, particularly for large volumes where online transfer would be inefficient. Services like those from Acronis and Barracuda allow users to back up data to a provided hard drive, mail it to the provider's data center for upload, and then initiate ongoing synchronization.[145][146] Subsequent updates employ incremental synchronization, transferring only changed data blocks to minimize bandwidth usage and maintain currency.[147] This approach aligns with the 3-2-1 backup rule—three copies of data on two media types, with one offsite—achieved through geo-redundant storage that replicates backups across multiple regions for fault tolerance.[148] Providers like AWS and Azure support geo-redundancy natively, ensuring an offsite copy remains accessible even if a primary region fails.[149]Data Optimization Techniques
Compression and Deduplication
Compression and deduplication are key data reduction techniques employed in backup systems to minimize storage requirements while preserving data integrity for restoration. These methods address the growing volume of data by eliminating redundancies and shrinking file sizes, enabling more efficient use of local, remote, or cloud storage resources. Compression operates by encoding data more compactly, whereas deduplication identifies and stores only unique instances of data blocks, preventing duplication across backups. Together, they can significantly lower the effective storage footprint, with typical combined reductions ranging from 5:1 to 30:1 depending on data characteristics.[150][151] Compression in backups relies on lossless algorithms that reduce data size without any loss of information, ensuring bit-for-bit accurate recovery during restoration. LZ4, developed for high-speed operations, achieves compression speeds exceeding 500 MB/s per core and is ideal for scenarios prioritizing performance over maximal size reduction, often yielding modest ratios suitable for real-time backups. In contrast, Zstandard (Zstd), which has become a default choice in many systems by 2025, offers a superior balance of speed and efficiency; internal benchmarks show it providing 30-50% better compression than predecessors like MS_XPRESS for database backups, typically reducing sizes by 50-70% on redundant data sets such as logs or structured files. For example, a 100 GB database backup compressed with Zstd at level 3 can shrink to 30-50 GB, depending on inherent data redundancy. These algorithms are widely integrated into backup tools to handle diverse data types without compromising restorability.[152][153][154] Deduplication further optimizes backups by detecting and eliminating duplicate data blocks, a process particularly effective in environments with high redundancy like virtual desktop infrastructure (VDI). Block-level deduplication divides files into fixed or variable-sized chunks, computes a cryptographic hash for each—commonly using SHA-256 for its collision resistance—and stores only unique blocks while referencing duplicates via pointers. This approach can yield savings of 10-30x in VDI backups, where identical virtual machine images lead to extensive overlap, reducing 100 TB of raw data to as little as 3.3-10 TB of physical storage. Deduplication occurs either inline, where redundancies are removed in real-time before writing to storage to conserve immediate space and bandwidth, or post-process, where data is first stored fully and then analyzed for duplicates in a separate pass, which may incur higher initial resource use but allows for more thorough optimization. Inline methods are preferred in bandwidth-constrained cloud environments, though they demand more upfront CPU cycles.[155][156][151] When combining compression and deduplication, best practices dictate performing deduplication first to remove redundancies from the full dataset, followed by compression on the resulting unique blocks, as this maximizes overall efficiency by avoiding redundant encoding efforts. The effective backup size can be approximated by the formula: Here, the duplication ratio represents the fraction of redundant data (e.g., 0.9 for 90% duplicates), and the compression ratio is the fractional size reduction after deduplication (e.g., 0.5 for 50% smaller). This sequencing, as implemented in systems like Dell Data Domain, applies local compression algorithms such as LZ or GZfast to deduplicated segments, achieving compounded savings without inflating processing overhead. Tools like Bacula incorporate built-in deduplication via optimized volumes that use hash-based chunking to reference existing data, supporting both inline and post-process modes for flexible deployment. However, challenges include elevated CPU overhead during intensive hashing and scanning—particularly in inline operations—and rare false positives from hash collisions, though SHA-256 minimizes this risk to negligible levels for most datasets. In variable data environments, such as those with frequent changes, tuning block sizes helps mitigate these issues.[157][158][159] By 2025, trends in backup optimization increasingly leverage AI-accelerated deduplication for unstructured data in cloud environments, where traditional hash-based methods struggle with similarity detection in files like documents or media. Adaptive frameworks, such as those employing machine learning for resemblance-based chunking, enhance ratios on enterprise backups and cloud traces, routinely achieving 5:1 or higher reductions by intelligently grouping near-duplicates. These AI enhancements, integrated into platforms handling VM snapshots and object storage, address the explosion of unstructured data growth while maintaining low latency for scalable cloud backups.[160]Encryption and Security Measures
Encryption plays a critical role in protecting backup data from unauthorized access, ensuring confidentiality both during storage and transmission. The Advanced Encryption Standard (AES) with 256-bit keys, known as AES-256, is widely adopted as the industry benchmark for securing backup data due to its robustness against brute-force attacks.[161] For instance, solutions like Veritas NetBackup and Veeam Backup employ AES-256 to encrypt data written to repositories, tape libraries, and cloud storage.[162][163] Encryption at rest safeguards stored backup files, preventing access if physical media or storage systems are compromised, while encryption in transit protects data as it moves between source systems and backup locations. Tools such as Veritas Alta Recovery Vault apply AES-256 encryption for both at-rest and in-transit protection, often integrating FIPS 140-2 validated modules to meet federal cryptographic standards.[164][165] Microsoft BitLocker, a full-volume encryption tool, is commonly used for at-rest protection on Windows-based backup media, ensuring that entire drives remain inaccessible without the decryption key. Effective key management is essential to maintain security, with protocols like the Key Management Interoperability Protocol (KMIP) enabling centralized control and distribution of encryption keys across heterogeneous environments.[166] AWS services, for example, leverage AWS Key Management Service (KMS) for handling keys in backup encryption, supporting seamless rotation and auditing.[167][168] Beyond encryption, additional security measures enhance backup resilience against threats like ransomware. Immutable storage prevents alterations or deletions of backup data for a defined retention period, with Amazon S3 Object Lock providing write-once-read-many (WORM) functionality that locks objects for configurable durations, typically ranging from days to years, to comply with regulatory retention requirements.[169] Air-gapping isolates backups by physically or logically disconnecting them from networks, creating an offline barrier that ransomware cannot traverse, as seen in strategies combining immutable copies with offline media.[170] Multi-factor authentication (MFA) adds a layer of access control, requiring multiple verification methods to authenticate users or systems before permitting backup operations or recovery.[171] Ransomware attacks have intensified the focus on these protections, particularly following the 2021 Colonial Pipeline incident, where the DarkSide ransomware group disrupted fuel supplies across the U.S. East Coast, highlighting the need for secure, isolated backups to enable rapid recovery without paying ransoms.[172] By 2025, ransomware tactics increasingly target backups first, prompting adoption of behavioral analysis to detect anomalous patterns in backup access and isolated recovery environments that allow restoration from clean copies without reinfection.[173] Tools like Rubrik incorporate built-in immutability and air-gapped architecture, using WORM policies to lock backups and provide malware threat intelligence for proactive defense.[174][175] Compliance frameworks further guide these practices, with NIST Special Publication 800-53 outlining controls for system and communications protection, including encryption requirements for backups to ensure data integrity and confidentiality.[176] Zero-trust models, as detailed in federal guidelines, mandate continuous verification of all backup access requests, treating every interaction as potentially hostile regardless of origin.[177] Auditing logs maintain a chain of custody by recording all backup events, from creation to restoration, enabling traceability and forensic analysis in line with NIST AU-10 controls.[178][179] Despite these benefits, encryption and security measures introduce challenges, such as the risk of key loss, which could render backups irretrievable if not mitigated through secure storage and recovery procedures. Performance impacts arise from computational overhead, potentially slowing backup and restore operations, though hardware-accelerated implementations minimize this in modern systems. Rubrik's immutable features address some challenges by integrating encryption with immutability without compromising recovery speed.[180] Encryption is typically applied after compression to optimize both security and efficiency.Other Manipulations
Multiplexing in backup processes involves interleaving multiple data streams from different sources onto a single target storage device, such as a tape drive, to optimize throughput and minimize idle time. This technique allows backup software to read data from several files or clients simultaneously while writing to one destination, effectively balancing the slower data ingestion rates from sources against the higher speeds of storage media. For instance, in tape-based systems, a common multiplexing ratio like 4:1—where four input streams are combined into one output—can significantly improve overall backup performance by keeping the drive operating at near-full capacity.[181][182][183] Staging serves as a temporary intermediate storage layer in backup workflows, particularly within hierarchical storage management (HSM) systems, where data is first written to high-speed disk before relocation to slower, higher-capacity media like tape. This approach enables verification, error checking, and processing of backup images without directly burdening final storage, reducing the risk of incomplete transfers and allowing for more efficient resource allocation in multi-tier environments. In practice, disk staging storage units hold images until space constraints trigger automated migration, ensuring that recent or active data remains accessible on faster tiers while older data moves to archival storage.[184][185][186] Refactoring of backup datasets entails reorganizing stored data to enhance accessibility and efficiency, often through tiering mechanisms that classify information as "hot" (frequently accessed) or "cold" (infrequently used). Hot data is retained on performance-oriented storage like SSDs for quick retrieval during recovery, while cold data is migrated to cost-effective tiers such as archival disks or tape, optimizing both speed and expense without altering the underlying backup content. This reorganization supports dynamic adjustment based on access patterns, ensuring that backup systems align with evolving data usage needs in enterprise settings.[187][188] Automated grooming automates the pruning of obsolete backups according to predefined retention policies, systematically deleting expired images to reclaim storage space and maintain compliance. Tools like Data Lifecycle Management (DLM) in backup solutions monitor retention periods and execute cleanup cycles—typically every few hours—marking and removing sets once their hold time elapses, which prevents storage bloat and simplifies management. By 2025, advancements in AI integration enable anomaly-based grooming, where machine learning detects irregularities in backup patterns, such as unexpected data growth or corruption, to proactively refine retention and cleanup processes beyond rigid schedules.[189][190][191] These manipulations find key applications in Storage Area Network (SAN) environments, where multiplexing and staging combine to shorten backup windows by parallelizing data flows and buffering transfers, allowing large-scale operations to complete faster without overwhelming network resources. For example, in SAN-attached setups, staging to disk before tape duplication enables concurrent processing of multiple hosts, while multiplexing ensures continuous drive utilization, collectively reducing downtime in high-volume data centers.[182][192][193]Management and Recovery
Scheduling and Automation
Scheduling in backup processes involves defining specific times or conditions for initiating data copies to ensure consistency and minimal disruption. Traditional methods often rely on cron jobs, a Unix-like system utility for automating tasks at predefined intervals, such as running full backups nightly at off-peak hours to avoid impacting business operations.[194][195] Policy-based scheduling, common in enterprise environments, allows administrators to set rules for backup frequency and type—such as full backups weekly and incrementals daily—aligned with recovery time objectives (RTO) and recovery point objectives (RPO) while steering clear of peak system loads during business hours.[196][197] Automation tools streamline these schedules by integrating with orchestration platforms and cloud services. Ansible, an open-source automation tool, can deploy and manage backup jobs across hybrid environments, including configurations for Veeam Backup & Replication to handle scheduling and execution without manual intervention.[198] Veeam provides built-in automation for job orchestration, supporting scripted deployments and API-driven scheduling for consistent backups.[199] Cloud schedulers like AWS Backup enable policy-driven automation, where rules define backup windows, retention, and transitions to colder storage tiers automatically.[200] Event-triggered backups enhance responsiveness by initiating processes based on specific conditions, such as file modifications detected via tools like inotify on Linux systems or Veeam Agent's event monitoring for changes during active sessions.[201][202] Best practices emphasize resource efficiency and foresight in scheduling. Staggered schedules distribute backup loads across time slots—for instance, grouping servers into cohorts to prevent simultaneous I/O spikes on shared storage—reducing contention and improving overall system performance.[203][204] In 2025, artificial intelligence (AI) is increasingly applied for predictive scheduling, using machine learning to forecast data growth patterns and adjust backup frequencies proactively, thereby optimizing storage usage and minimizing unnecessary operations.[205][206] Scheduling can briefly incorporate rotation policies, such as the grandfather-father-son scheme, to cycle through backup sets without overlapping critical windows.[207] Effective monitoring is integral to automation, providing real-time oversight of backup operations. Alerts for failures, such as job timeouts or incomplete transfers, can be configured through platform-native tools like AWS Backup's event notifications or Azure Monitor, enabling rapid response to issues.[208][209] Integration with Security Information and Event Management (SIEM) systems, as supported by Veeam and solutions like Keepit with Microsoft Sentinel, correlates backup events with security logs for holistic threat detection and anomaly alerting.[210][211] Challenges in backup automation often center on failure handling and reliability. Transient issues like network disruptions can cause job interruptions, necessitating retry mechanisms—such as exponential backoff in Veeam or automated re-execution in Azure Backup—to attempt recovery without manual escalation.[212][213] Notifications via email, SMS, or integrated dashboards ensure administrators are informed of persistent failures, while scripting automation significantly reduces manual errors by enforcing consistent processes and eliminating oversight in routine tasks.[214][215]Onsite, Offsite, and Backup Sites
Onsite backups involve storing data copies at the primary facility, enabling immediate access for quick recovery from minor incidents such as hardware failures or user errors. This approach typically achieves a low recovery time objective (RTO) of less than one hour due to the proximity of storage media like local disks or tapes, allowing rapid restoration without external dependencies. However, onsite storage carries significant risks as a single point of failure, vulnerable to localized threats including fires, floods, or power outages that could destroy both primary and backup data simultaneously.[216][217][218] Offsite backups address these limitations by replicating data to geographically separate locations, such as secure vaults or dedicated disaster recovery (DR) sites, to protect against site-wide disruptions. These facilities must meet criteria for physical separation, environmental controls, and access security to ensure data integrity. Offsite strategies are classified into types based on readiness: hot sites, which are fully mirrored and active for near-real-time failover; warm sites, featuring partial equipment and periodic synchronization for recovery in hours to days; and cold sites, providing basic infrastructure like power and space but requiring full setup over days or weeks, often using tape archival for long-term storage.[216][216][216] Backup sites extend offsite capabilities by maintaining full system replicas for seamless failover, particularly in cloud environments where multi-region deployments enhance global resilience against regional outages. As of 2025, providers like AWS emphasize multi-region architectures to distribute workloads across availability zones, minimizing single-point failures and supporting RTOs aligned with business criticality.[219][220] Key strategies for offsite implementation include electronic vaulting, which automates data transfer to remote storage via replication or journaling for faster, more secure delivery compared to physical shipment of media like tapes. Electronic vaulting reduces labor and transit risks while enabling quicker access, though it requires robust network security. In contrast, physical shipment suits cold storage but incurs higher costs from handling and delays. Cost-benefit analyses show offsite solutions, especially electronic methods, significantly mitigate downtime by enabling recovery from disasters that could otherwise extend outages for days, aligning with the 3-2-1 rule of maintaining three data copies on two media types with one offsite.[221][216][222] Legal considerations for offsite backups emphasize data sovereignty, particularly in cross-border transfers, where regulations like the EU's General Data Protection Regulation (GDPR) mandate that personal data of EU residents remain subject to equivalent protections regardless of storage location. As of 2025, additional frameworks such as the EU's NIS2 Directive require enhanced cybersecurity measures, including regular testing of backup and recovery processes for critical sectors. Organizations must ensure offsite sites comply with jurisdictional laws, such as keeping EU data within the EU or using approved transfer mechanisms to avoid penalties.[223][223][224]Verification, Testing, and Restoration
Verification of backups is essential to confirm data integrity after the backup process, preventing silent corruption that could render restores ineffective. Post-backup verification typically involves computing and comparing checksums, such as MD5 or SHA-256 hashes, against the original data to ensure 100% integrity.[225] Automated tools perform these scans routinely, detecting bit rot or transmission errors without manual intervention, and are recommended as a standard practice in data protection workflows.[226] Testing backups ensures they are not only complete but functional for recovery, mitigating risks from untested assumptions. Organizations often conduct quarterly full restores in isolated sandbox environments to simulate real-world scenarios without impacting production systems.[227] Tabletop exercises for disaster recovery involve team discussions of hypothetical failures, validating coordination and procedures without executing actual restores.[228] According to a 2025 report, only 50% of organizations test their disaster recovery plans annually, highlighting a gap in proactive validation.[229] Restoration processes vary between granular file-level recovery, which targets specific items for quick access, and full system restores, which rebuild entire environments from images. Key steps in a full system restore include mounting the backup image to a target volume, applying any incremental changes or logs, and booting the system in a test environment to verify operability.[230] Challenges in restoration include prolonged times, particularly from tape media, where recovering 1TB of data may require up to 48 hours due to sequential access and hardware limitations. Additionally, approximately 50% of backup restores fail, often because they were never tested for recoverability.[231][232] Best practices emphasize documented runbooks that outline step-by-step recovery actions, alongside regular validation of Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) to align with business needs. Immutable backups, which lock data against modifications, facilitate clean restores following ransomware incidents by ensuring attackers cannot tamper with copies.[233] Offsite copies may be incorporated into tests to confirm multi-location viability.References
- https://api-int.fmaas-devstage-backend.fmaas.res.[ibm](/page/IBM).com/libweb/oWui3g/6OK122/data_backup-solutions__for-enterprise.pdf
