11.6 Protecting Data for Long Term (Data Archive)

In an organization's data center, data is actively created, accessed, and changed. As data ages, it is less likely to be changed and eventually becomes “fixed” but continues to be accessed by applications and users. This data is called fixed content. For example CAD/CAM designs, surveillance video, MP3s, and financial documents are just a few examples of fixed data that is growing at over 90% annually. Data archiving is the process of moving data (fixed content) that is no longer actively accessed to a separate low cost archival storage tier for long term retention and future reference.

Archiving is used to efficiently store data that is no longer used or only very infrequently used but needs storing. For example email archiving, copies of email are made as they are sent and received. The most common purpose of archiving and retrieving is for legal compliance requirements. Archives are the proof that can be guaranteed to be unchanged from the time it was stored. Backups are used to bring back the lost or corrupted data into the original location and it can be used for original purposes. Archives, on the other hand are used to locate data based on its content, and usually to copy the data to a new location where it can be used for a different purpose often for legal compliance.

Data archive is a storage repository that is used to store these data. Organizations set their own policies for qualifying data to be moved into archives. These policy settings are used to automate the process of identifying and moving the appropriate data into the archive system. Organizations implement archiving processes and technologies to reduce primary storage cost. With archiving, the capacity on expensive primary storage can be reclaimed by moving infrequently-accessed data to lower-cost archive tier.

Also Read: What is System State Backup and why do we need it ?

The key to determine how long to retain an organization's archives is to understand which regulations apply to the particular industry and which retention rules apply to that regulation. Archiving helps organizations to adhere to these compliances. Archiving can help organizations use growing volumes of information in potentially new and unanticipated ways. For example, new product innovation can be fostered if engineers can access archived project materials such as designs, test results, and requirement documents. In addition to meeting governance and compliance requirements, organizations retain data for business intelligence and competitive advantage. Both active and archived information can help data scientists drive new innovations or help to improve current business processes.

Archiving solutions should meet an organization’s compliance requirements through automated policy-driven data retention and deletion. It should provide the features such as scalability, authenticity, immutability, availability, and security. The archiving solution should be able to authenticate the creation and integrity of files in the archive storage. Long-term reliability is key for archiving solutions because failure of an archive system could have disastrous consequences. These systems hold critical documents, and any failure could have compliance, legal, and business consequences. The archiving solution should also support variety of online storage options such as disk-based storage and cloud-based storage. Another key factor is to provide support for variety of data types including e-mails, databases, pdfs, images, audios, videos, binary files, and HTML files.

Powerful indexing and searching capability on archiving solutions speeds up the data retrieval. An effective archival system needs to support complex searches of content within documents. Archiving solutions should enable electronic discovery (eDiscovery) and sharing of data for litigation purposes in a timely and compliant manner. Reporting capabilities are required to process huge volumes of data and deliver customized reports for compliance requirements.

Archive Operation

Archiving solution architecture consists of three key components: archiving agent, archiving server, and archiving storage device.

An archiving agent is a software installed on the application servers (example: File servers and E-mail servers). The agent is responsible for scanning the data and archiving it, based on the policy defined on the archiving server (policy engine). After the data is identified for archiving, the data will be moved to the archiving storage device. From a client perspective, this movement is completely transparent.

Then, the original data on the primary storage is replaced with a stub file. The stub file contains the address of the archived data. The size of this file is small and significantly saves space on primary storage. When the client is trying to access the files from the application servers, the stub file is used to retrieve the file from the archive storage device.

Also Read: How to take online image backups

An archiving server is software installed on a server that enables administrators to configure the policies for archiving data. Policies can be defined based on file size, file type, or creation/modification/access time. Once the data is identified for archiving, the archiving server creates the index for the data to be moved. By utilizing the index, users may also search and retrieve their data with the web search tool.

Backup is driven by the need for recoverability and disaster protection while archiving is driven by the need for improved efficiency and to address compliance challenges. Real cost savings can be realized by adopting a strategy for the physical storage of both backup and archiving workloads. To accomplish this, a common storage target must be able to handle the throughput and inline deduplication requirements of backup workloads and secure and long-term retention requirements of archive workloads. In addition, the storage target should provide built-in capabilities for network-efficient replication for disaster recovery needs, enterprise features such as encryption, and allow for easy integration with existing application infrastructure. By leveraging a common infrastructure for both, organizations can greatly ease the burden of eDiscovery, data recovery, business continuity, and compliance and achieve these goals in the most cost-efficient manner.

Content Addressed Storage (CAS)

Data integrity, scalability, and protection are the primary requirement for any data archiving solution. Traditional archival solutions such as CD, DVD-ROM and tape do not provide the required scalability, availability, security, and performance. Content addressed storage (CAS) is a special type of object-based storage device purposely built for storing and managing archives (fixed content). CAS stores user data and its attributes as an object.

The stored object is assigned a globally unique address, known as a content address (CA). This address is derived from the object’s binary representation. This content addressing eliminates the need for applications to understand and manage the physical location of object on storage system. This content address (digital fingerprint of the content) not only simplifies the task of managing huge numbers of objects, but also ensures content authenticity. The key features of CAS are as follows: