Archive
Body of Knowledge |
---|
Document Production Workflow |
Lifecycle Category |
Archive |
Content Contributor(s) |
Franklin Friedmann edp |
Original Publication |
August 2014 |
Copyright |
© 2014 by Xplor International |
Content License |
CC BY-NC-ND 4.0 |
What is an Archive?
Gartner has summarized the general archival process as “A consolidated system for storage, access, management and viewing of data that is often print-stream- originated. Leading uses of IDARS (Image and Document Archive and Retrieval Systems) include mission-critical customer service support, electronic bill presentment, management and distribution of report data (e.g., mainframe output, transaction logs and financial reports) and long-term archiving of historical data.”
The 20th Century saw the emergence of a number of technologies based on Document Imaging. Film was used to store the pages using Computer Output to Microfilm (COM) and later Microfiche. COM technology was less expensive than paper, was eventually legally accepted in place of original paper documents. Indexes usually referred to the image location such as distance down the roll of microfilm or the card number of the microfiche. A further extension placed the information on a laser disk (COLD).
In the 21st Century the field is crowded enough to require us to differentiate types of archives based on usage. Now it moves beyond the deliberate preservation of records and remote storage to preserving with the stated objective to protect against loss. Today there are multiple approaches starting with occasional retrieval and moving along the gamut to the constant access found in Electronic Content Management systems.
Backup versus Archive
There is a difference between Backup and Archive. Where a Backup is a secondary copy typically used in recovery operations, an archived copy is intended as a duplicate maintained for analysis, value generation or compliance. Archived copies typically include context and meet regulatory requirements.
Dynamics Driving Archive Storage Management
What are the parameters that demand an archive?
- With the rapid adoption of Internet technology, the expected rate of information expansion, based on volume and the new data types taking significant amounts of storage is expected to double every 6 – 12 months. Each organization must prepare for this inundation.
- There is a staggering cost to be paid to provide this storage if everything is equally accessible. Decisions will be made to tier accessibility (also known as a storage hierarchy, where content is migrated to lower cost media as it ages and access frequency typically declines).
- To protect both users and organizations, legislation has been passed in most countries, requiring compliance, particularly in the medical and financially related fields.
To ignore these laws and their implications, invites litigation. Litigation forces retrieval of information. The defendants’ case is in peril with an enormous financial exposure if the information can’t be found or is not an accurate rendition of the original.
A Computer Archive
Let’s consider the characteristics of a good archive.
- There must be significant, extensible storage. Storage that can be expanded and/ or replaced when necessary is essential.
- To address this, a tiered (hierarchical) storage is preferable with cascading performance levels. Further, the ability to add another tier without reloading is a must, as organizations may initially decide to limit their investment. Note that sub-second latency is not usually expected of archives.
- Technologies relating to price/performance of the components vary from direct memory, online disk, alternate sites (online) disk and optical media through to loadable tape / cassettes.
- A strategy must be developed and implemented to cascade the objects to slower, less expensive devices. This activity must be automated; there isn’t enough staff to perform any of the actions beyond mounting tapes/cassettes, nor would manual processes be reliable or accurate enough to meet best practice needs and expectations.
- Processes for loading and retrieval suitable to the current and future volumes depending on the request must be automated as well.
- As technology becomes a global requisite, UNCITRAL Model Law, a United Nations set of standards including the modernization and harmonization of rules on international business, will impact archiving. Staying in tune with the standards, now and as they evolve will be a target of businesses.
The Storage Application
Having identified the device technology, there must be concomitant software / firmware to operate the archive to classify, move, index and discover data.
- A robust Workflow that manages the process must be included. Current technology provides the management process which in the Best of Breed includes monitoring screens that graphically show activity with performance related statistics. Jobs underway can be displayed, providing a drill down process to individually view their state.
- International standards will be followed. To this end the International Standards Organization (ISO) has formulated standards. “The initial effort has been the development of a Reference Model for an Open Archival Information System (OAIS). The OAIS Reference Model has been reviewed and pending some editorial updates, it has been approved as an ISO Standard and as a CCSDS Recommendation.”
- This covers the delivery to the archive, retrieval and data formats, for only one reason; that someday in future after the current operators, analysts, programmers and managers have retired or simply left, the information can be retrieved with the then current tools. The best situation is an integrated discovery tool with the archive joined to the indexing process.
- During this process, a standardized indexing system is used. The ability to build Metadata, which defines the content in a meaningful way, is crucial, particularly if litigation is involved. When the data is stored, there is no guarantee what information will need to be retrieved nor on what basis.
- Support for extensive format types and data standards is required. Resources such as images, TIFF bit maps, and print streams such as PostScript, PCL and AFPDS may prove difficult to retrieve over time. An archive of PDF documents is advisable. As an example, PDF/A (PDF for archiving), provides an indication of the technical complexities.
- PDF/A standards require that all resources must be imbedded. No audio or video content, encryption, or executable files is permitted. The Metadata must follow standards. All these restrictions are to increase the likelihood that an archived document can be read successfully decades later (so, no passwords to lose, no exotic video/audio codes required, and no obsolete language interpreters needed). The following details are particularly valuable:
- PDF/A-1 is a constrained form of Adobe PDF version 1.4 intended to be suitable for long-term preservation of page-oriented documents for which PDF is already being used in practice. The ISO standard [ISO 19005-1:2005] was developed by a working group with representatives from government, industry, and academia and active support from Adobe Systems Incorporated.
- The PDF/A standards attempt to maximize:
- device independence,
- self-containment, and
- self-documentation.
- The constraints include:
- Audio and video content are forbidden,
- Javascript and executable file launches are prohibited,
- All fonts must be embedded and also must be legally embeddable for unlimited, universal rendering,
- Colorspaces specified in a device-independent manner,
- Encryption is disallowed.
- Use of standards-based metadata is mandated.
- The PDF/A-1 standard defines two levels of conformance: conformance level A satisfies all requirements in the specification; level B is a lower level of conformance, "encompassing the requirements of this part of ISO 19005 regarding the visual appearance of electronic documents, but not their structural or semantic properties."
- PDF/A standards require that all resources must be imbedded. No audio or video content, encryption, or executable files is permitted. The Metadata must follow standards. All these restrictions are to increase the likelihood that an archived document can be read successfully decades later (so, no passwords to lose, no exotic video/audio codes required, and no obsolete language interpreters needed). The following details are particularly valuable:
- One must ensure that any new document type can be handled for input and output.
- An organization’s requirements will change, but adherence to data standards must remain. If necessary, this may mean reformatting data, recognizing that maintaining the applications that store and retrieve non-standard data may not be indefinitely practicable. Archive and retrieval software selection has decade- plus implications.
- Addressing data compaction is necessary. Technology to reduce repetition or clutter can see reductions reported in excess of 80%. One approach, deduplication, is performed as archiving is taking place. This brings up the danger that deletion at some point in future must not damage referential markers that have been developed for the multiple sets of information.
Lifecycle-based Decision Making
A process known as Records Retention Management is developed by each organization. The objective is to identify the need and value of the information being stored and to develop management rules for each type of document. As a particular set of essential business information, it needs this formalized process.
Lifecycle Embodiment covers object creation, usage, level of access (how important, what retrieval parameters) archival (how long) and finally disposition (deletion). For our purpose, we concentrate on the archival and deletion aspects.
- Determining what to archive requires management attention as the variety of objects expands. Today vast amounts of data are being generated by social media applications such as Facebook, YouTube and Twitter that relate to retail operations where the company’s products are assessed in the public domain. This is the operational arena of various forms of analytics, information that has immediacy.
- Using Customer Experience Management techniques the likes and dislikes are evaluated. Does this information have more than a transitory use? Should it end up in a corporate data warehouse, shifted to an archive for lengthy post mortems? This is just one class of information coming in from the public sphere that has chat rooms, formal websites and so on.
- Risk Management techniques are used to assess the information to be stored against criteria including legal and compliance requirements.
- Deletion / data destruction: this brings up the need to define deletion criteria, based on need and legally necessity. In particular, there is an advisory that once the threshold has been reached, the data / objects should be discarded with the exception if litigation requires a hold.
Note that the need to retain information depends on the information and industry. Tax information, depending on the country, may be 6 – 7 years after filing. Insurance records must usually be retained for six years after the end of coverage, which is minimal for an automobile policy, but could be over 100 years for a Life Insurance Policy. In the US, credit card bills must be retained for seven years; in the European Union, it is ten years.
It is essential that the records / information must ultimately be retrievable, which requires and assessment of the life span of data versus the supporting technology.
Disaster Recovery – The Place of Archiving
Disaster Recovery requires an operational restart. Disaster Recovery plans are developed to ensure that the organization has access to all data required to continue business, and that may involve resorting to archives as a last resort. However, an archive doesn’t offer the resources that are consistent with Disaster Recovery. It is wise to have a backup of the archive in place.
Best Practice for Archives
What characterizes best practices?
- Governance: Management oversight in establishing the policies that determine the archive parameters.
- Administration: Once policies have been established, classification of the objects must be made and inserted as control over the processes. Polices take security into account, not only protecting the entire archive from improper use or intrusion, but individual’s information security such as the medical records enunciated in the USA by HIPAA. Individual’s information is complex as jurisdictions vary on requirements. As an example, California, USA considers a person’s ZIP (postal) code subject to protection!
- Physical Implementation: An archive is a physical choice; utilities, products or services to deliver the archive function must be chosen. It is strongly recommended that the choice is holistically chosen making sure that the components work together.
- Format: Choice of archiving format; use PDF/A-1 to ensure future data retrievability.
References
Phil Goodwin, How to create a successful data archiving strategy, Storage Digest, January 2014
Lindsay Wyse, Four factors to weigh in planning an analytics big data architecture, Search Business Analytics, July 2013
ISO Archiving Standards, http://nssdc.gsfc.nasa.gov/nost/isoas/ last revised y John Garrett, 2006
Phil Goodwin, Archive It! Storage Magazine Online, June 2013
Gartner, IT Glossary http://www.gartner.com/it-glossary/idars-integrated-document-+archive-and-retrieval-system
Technical Information Paper No. 12, Digital-Imaging and Optical Digital Data Disk Storage Systems: Long-Term Access Strategies for Federal Agencies, http://+www.archives.gov/preservation/technical/storage-strategy.html, July 1994.
Elings and Waibel, MetaData for All, First Monday Journal, March 2007, http://+firstmonday.org/article/view/1628/1543#author
Long term preservation (PDF/A) http://www.digitalpreservation.gov/formats/fdd/+fdd000125.shtml
UNCITRAL websites: http://www.uncitral.org/uncitral/en/uncitral_texts/+arbitration/1985Model_arbitration.html