Event:

Join us at Xplor 2024 SUMMIT September 24-26 in Orlando, Florida

Data

From Xplor Wiki
Revision as of 09:40, 7 November 2023 by ScottMulkey (talk | contribs) (Correct syntax)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
EDBOK Guide
EDBOK-book cover.png
Body of Knowledge
Document Production Workflow
Lifecycle Category
Data
Content Contributor(s)
Kevin Tondreau edp
Original Publication
August 2014
Copyright
© 2014 by Xplor International
Content License
CC BY-NC-ND 4.0

What is Data?

The term data means different things to different people. What does it mean in the document creation world? What types of data can be presented to document composition tools or to programmers to code and design a document? What type of data is required for printers to perform their magic to generate a printed page, or for an online archive solution to provide internal or external document access? What are data types and how are they generated, stored and used?

Raw data (a transaction document systems phrase, meaning data in its native format in customer databases, accounting systems, and so on) is composed of records, fields within the records, and characters and numeric values within the fields. How the data is analyzed and formatted depends on its role. For it to be parsed and presented correctly in its final format requires a basic knowledge of data structures and correct data storage protocols.

Languages of Data

There were originally two basic language types used to store data: EBCDIC (Extended Binary Coded Decimal Interchange Code) and ASCII (American Standard Code for Information Interchange). To support extended character sets, Unicode has emerged as the standard encoding system for a majority of computer systems. Unicode defines a code space of 1,114,112 code points, each available to map to a character.

Each language uses a different approach to describing data. Basic character mapping is accomplished through the use of code pages that map code points to hex values that are interpreted based on the language of the encoding.

EBCDIC is the common language for IBM Mainframe Operating Systems. It describes letters, numbers and standard punctuation marks. EBCDIC evolved to an expanded character set up to 256 characters using an eight-bit encoding schema.

ASCII is used in Windows, UNIX and most non-mainframe systems. ASCII contains letters, numbers and standard punctuation marks. Depending on the system storing the data, the standard seven-bit string allows 128 characters; some systems will expand to eight bits of data, which allows for an expanded character set of 256 characters.

Systems that permit 32 bits of data allow for characters in many of the world’s languages that require more than 256 characters to be encoded into a single code page.

Unicode is emerging as the preferred encoding system to allow multi languages to co-exist in a single document regardless of the presentation platform (print or view). This has greatly reduced the errors and incompatibility in displaying and transforming information.

There can be some issues in translating between or among the three schemes since there may be mismatches in the number of characters supported. For a standard translation of the numbers 1, 2 and 3 the Hex values are:

EBCDIC - X’31’, X’32’, X’33’
ASCII - X’F1’, X’F2’, X’F3’
Unicode - X’0031’, X’0032’, X’0033’

Packed Decimal Format

Packed Decimal Data format was developed by IBM in the 1960s to increase transmission times and reduce overall storage space. Packed-decimal format means that each byte of storage (except for the low order byte) can contain two decimal numbers. The low-order byte contains one digit in the leftmost portion and the sign (positive or negative) in the rightmost portion. The standard signs are used: hexadecimal F or C for positive numbers and hexadecimal D for negative numbers. The packed-decimal format looks like this:

Decimal BCD
Digit 8 4 2 1
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
5 0 1 0 1
6 0 1 1 0
7 0 1 1 1
8 1 0 0 0
9 1 0 0 1

The packed data format reduces the storage space by half. For the digits 92 in a longer number, for example:

Standard 8 Byte Storage
Decimal 9 2
Binary 10010000 00000010
Packed Data Storage
Decimal 9 2
Binary 1001 0010

Packed data format also needs to assign a positive or negative value.

Sign BCD Sign
Digit 8 4 2 1
A 1 0 1 0 +
B 1 0 1 1
C 1 1 0 0 +
D 1 1 0 1
E 1 1 1 0 +
F 1 1 1 1 +

An example of a negative value being stored is:

1 2 3 4 5 6 7 -
0001 0010 0011 0100 0101 0110 0111 1101

Data file structure can take on many different variations. It defines methods to store information and use the information to generate a readable document. Key to this is the use of metadata to give each field meaning. In the Fixed Length Records and Fields example below the column headings (First Name, Last Name etc.) are metadata that give meaning to the data values. The different common file structures described below each have the metadata conveyed in their own way.

Fixed Length Records and Fields are common. They have a defined length for each field and a maximum length of all fields combined. In the example below the overall record length is 67. Each field can only hold the maximum number of characters defined in the record layout.

First Name Last Name Address City State
Length 10 15 20 20 2
Position 1-10 11-25 26-45 46-65 66-67

Variable Length Record and Fields are also common in the print and processing area. Using variable length records saves disk space. There are record descriptors in the front of every record to allow the program or user to know the correct length of the record. The most common record descriptor is a 4-Byte RDW (Record Descriptor Word). Interpretation of RDWs is very basic. The first two bytes define the overall length of every record.

A simple example, where the value in the file is a HEX X‘2600’, this value in decimal for the first two bytes and record length would be 38 positions long. The last two positions are reserved and not used to determine the overall record length.

Delimited Fields are common for storing and reading data. A delimiter is used to separate fields within the same record.

An example of a layout for name and address using a basic delimited file would be (Sam,Smith,15 Main Street,Anytown,MA). Here a comma is used to separate the fields. There can be a risk in using a comma to delimit fields. Some fields can contain a comma as part of the field. A person’s surname can contain a title (Smith, MD). Here the value MD would be separated into its own field. A common approach is to place each field within or choose a delimiter that will never be part of the actual fields. All are valid formats and can be used to parse and use the data.

(Sam,Smith,15 Main Street,Anytown,MA)
(‘Sam’,’Smith, MD’,’15 Main Street’,’Anytown’,’MA’)
(Sam|Smith, MD|15 Main Street|Anytown|MA)
(‘Sam’|’Smith, MD’|’15 Main Street’|’Anytown’|’MA’)

XML Data Structure Records are a set of rules for encoding documents for data presentment. The record format is both human readable and computer readable. It is broken down into a few elements identified by tags identified by angle brackets. These tags are also the metadata. Each tag must have an open and close and within the tag there may be elements which also require and open and close tag pairs. In the example below there is a document with transaction elements and summary information.

<document> 
<MailState>NEW YORK NY 10019 -</MailState> 
<FeesCharged>$0.00</FeesCharged> 
<InterestCharged>$0.00</InterestCharged> 
<openclosedate>01/05/12 - 02/04/12</openclosedate> 
<TRANS>
<transdate></transdate> 
<transdescr>PAYMENTS AND OTHER CREDITS</transdescr> 
<transamt></transamt> 
</TRANS> 
<TRANS> 
<transdate>01/05</transdate> 
<transdescr>Payment - Thank You</transdescr> 
<transamt>-1,000.00</transamt> 
</TRANS> 
<TRANS> 
<transdate>01/08</transdate> 
<transdescr>Payment - Thank You</transdescr> 
<transamt>-1,000.00</transamt> 
</TRANS> 
<TotalFeesChargedInYYYY>$0.00</TotalFeesChargedInYYYY> 
<BALGRP> 
<balancetype>Purchases</balancetype> 
<apr>0.00</apr> 
<adb>$1,690.42</adb> 
<finchrg>-0-</finchrg> 
</BALGRP> 
<DaysInCycle>31</DaysInCycle> 
</document>

Record termination allows the program or operating system to know when a line feed (new line) is needed for presentment and readability. Windows primarily uses a carriage return and a line feed (Hexadecimal of X‘ODOA’), whereas the UNIX environments use a single value X‘OA’ to terminate a line.

Multi-Record Format Data describes a file constructed of data for processing by defining the record layout within a file by the record type.

12345678901234567890123456789012345678901234567890123456789012 
010NAME	   ADDRESS1	 ADDRESS2	CITY   	ST	ZIP	ACCTNUM 
020DATE	 TOTALDUE     INTEREST      PASTDUE      DATEDUE
030DATE DETAILTRANSACTION               AMOUNT              INTEREST 
030DATE DETAILTRANSACTION               AMOUNT              INTEREST
030DATE DETAILTRANSACTION               AMOUNT              INTEREST
030DATE DETAILTRANSACTION               AMOUNT              INTEREST
990ENDOFRECORD

In this data type the record type in the first few bytes of the record determines how to read each record for processing. Each record type will have a different position and different fields pertaining to the records. In the example above, the 01 record is the header record and all records between 01 and 99 are tied to the same recipient. Here each type of record (01, 02, 03…etc.) will be used to generate a document for print or Web view.

Field storage and extraction for processing are different from system to system. A relational database could have many information tables. Storing dates, currency, and names with upper and lower case sensitivity can require different formats. Presenting the date and currency in the correct format starts with how the information is stored and what the final intended presentation format should be. Most data element storage eliminates special characters and punctuation. A dollar value of $1,250.55 would be presented in a data file as 125055. This allows the presentment of text based on the country and separate placement of the currency character. Dates are more complex and are stored in many variations. A common format is to use a two position month and two position day of the month, with a four position year: 12012013. This allows the end presentment to be displayed in accordance with local or user conventions. This conversion typically takes place in a process described in the following section.

=== EXTRACT, TRANSLATE and LOAD (ETL) ===

Extract, Translate and Load (ETL, also called Extract, Transform and Load) is the process of gathering data from various source systems and preparing it for use.

Extract: Gathers the data from its various sources

Translate: Normalizes it by converting it from its stored forms into consistent human readable character data, and formatting the character data (date, currency, phone numbers, etc.) to conform to national or other conventions.

Load: Writes it to suitable files and locations for the composition step.

Most modern composition tools contain some ETL functionality, but given the variety of database systems in use and the complexity of most business processing schedules, the most efficient and effective ETL will generally be achieved by custom programming work in-house.

The date is an example which can be stored and presented many ways depending on the country of origin.  The format MM/DD/YY is common to the United States. Most of Europe uses DD/MM/YY. Japan uses YY/MM/DD. The separators may be slashes, dashes or periods. Some locales print leading zeroes, others suppress them. If a native Japanese speaker is reading a US English web page from a web site in Germany that contains the date 03/04/02 how do they interpret it?

ISO 8601 specifies YYYY-MM-DD, 2003-11-05. Looking at 11/05/03 on a document a US citizen would read as November 5, 2003 and some in Japan would see May 3, 2011.

Below is a record from the USPS that contains a scan of a mailing document within the USPS process. Highlighted in bold is the date including a time stamp for when the document was scanned at the USPS sort facility.

21280,891,09/20/2013 23:59:59,80919123412,00050021280300000060
30026,919,03/08/2013 05:36:44,30047430337,00270121280300000061

The programmer must define the record layout of each field for processing. This record contains five fields which are described below. The task of the developer is to determine the layout of the date and how to parse the information and store the data. Below is a snippet of code which outlines the format of the date field.

FILE NAME=Description='OneCode Confirm data file',FORMAT=CSV, 
  FIELD=(uspszip,1), 
  FIELD=(opcode,2), 
  FIELD=(datetime,3), 
  FIELD=(userzip,4), 
  FIELD=(uniqueid,5);

The date format the developer would present on a document as:

September 20, 2013 11:59PM and March 8, 2013 5:36AM 

Might be coded as:

DOCTRACKINGDBSESSIONINIT=alter session set NLS_DATE_FORMAT = 'yyyy-mm-dd';alter session set NLS_TIMESTAMP_FORMAT = 'yyyy- mm-dd   HH24:MI:SS' 

The process of storing date information as numeric values allows the processing environment to sort effectively and quickly. Storing months by name does not allow correct sort routines.

Currency storage can be basic, but presentment is the challenge. Storing the value as numeric only has its advantages, but the final presentment requires a process. Storage of the value of 1,234.99 would be 123499 and -1,234.99 would be store as 123499-.

Presenting the amount depends on the origin or the consumer. The developer would be required to format the data and place commas and decimals when required. The dollar value above can be presented as $1,234.99 where as the negative value can be presented as:

$1,234.99- 
-$1,234.99 
($1,234.99) 

Different currencies have different formatting practices, for example separating thousands with commas or periods, and some have different numbers of decimal places.

Similar conventions apply to telephone numbers in different countries (e.g. France, where numbers are presented 0A.BB.BB.BB.BB), and to other data types (e.g. social security and health insurance numbers), depending on the business application area.