Transformation
Body of Knowledge |
---|
Document Production Workflow |
Lifecycle Category |
Transformation |
Content Contributor(s) |
Chris Halicki edp and Linda McDaniel edp |
Original Publication |
August 2014 |
Copyright |
© 2014 by Xplor International |
Content License |
CC BY-NC-ND 4.0 |
What Is Document Transformation?
Document Transformation is the process of converting a document file from one print or view format to another. Examples include transforming an AFP file to
PDF or a Metacode file to PCL. Transforms provide a variety of options. They make it possible to take a file composed for a legacy print file format and make it available online in a standard online format, such as PDF, TIFF or HTML. Transform software may also be used to standardize print formats across an enterprise. Print streams such as AFP, Metacode or PDF can be converted to print on alternative enterprise printers or on distributed and departmental PCL or PostScript printers. This is a common requirement after a merger or corporate re-organization. Many transform products also support manipulation, standardization, and indexing of a print stream before it is loaded into an archive, when moving documents from an old archive system to a new one, or concatenation of files prior to printing.
In a production workflow, transformation normally follows document composition prior to printing or archiving, or it can be called as part of archive retrieval to deliver the document in a format that is more user-accessible.
What Happens in a Transform?
Transformation software parses an input file and its associated print resources (fonts, forms, images, page layout and control files) and generates the desired output format. The generated output may include inline resources or may contain pointers to external resources. External resources may be created in advance or may be built as part of the transformation. Most software allows for some changes to the document pages, such as adding watermarks or repagination, as part of the transformation.
While transforms generally convert page objects in one format to matching objects in a new format, due to the differences among print and view data streams transforms cannot always convert to equivalent objects in the output. Different types of page objects may behave differently during the transform process. (Refer to the Print Streams sections of this book for more details on each of the formats discussed below).
Text & Fonts
AFP, PCL 6, PDF and HTML all support TrueType or Open Type fonts. In some cases you will be able to use the same fonts in the output as in the input; but, very often that is not the case. Xerox DJDE-conditioned line data and Metacode use orientation-specific bitmapped fonts, and older-style AFP files may also use bitmapped fonts. PCL 4 or PCL 5 files may use printer resident fonts.
Transforms may do an on-the-fly conversion from an input font to an output font (if the font is available) or it may use a font mapping table. Transforms may also convert text to images.
In general, using font mapping when it is supported allows the transform program to run more efficiently because the font characters do not need to be converted. However, depending on the selection of fonts available in the output format the fonts may not be an exact match and spacing issues may occur in the output.
Bitmapped / Raster
Xerox Most AFP (C0…) Some PCL Adobe Type 3 |
Outline / Scalable / Vector
Adobe Type 1 Windows TrueType Open Type Most PCL Some AFP (CZ…) |
If bitmapped fonts are automatically converted or rendered as images they may be scaled depending on the resolutions available in the input and output formats. Scaling may cause very thin lines and serifs to disappear and spacing may change slightly due to the differences in resolution if the new resolution is lower or not an even multiple of the original resolution.
Transform tools that offer font mapping may permit fonts to be mapped or rendered to an image selectively on a font-by-font basis. When choosing fonts for a font mapping table the best practice is to be aware of the character widths, the x-heights, the point size and style.
In the following example the text alignment variations are shown for 10 point Times New Roman mapped to 10 point Garamond or to 10 point Century Schoolbook:
Times New Roman
A line with some CAPITAL LETTERS and lowercase letters and some numbers 1234567890
Garamond
A line with some CAPITAL LETTERS and lowercase letters and some numbers 1234567890
Century Schoolbook
A line with some CAPITAL LETTERS and lowercase letters and some numbers 1234567890
The Garamond characters are smaller and narrower while the Century Schoolbook characters are wider than the characters in the Times New Roman font. Even though the fonts are quite similar and the point size is the same, the line lengths are different. The words and numbers shift to the right or the left. If each word was positioned exactly as in Times New Roman it could cause overlapping text or large spaces between words. If you are using transform software it is useful to acquire font creation tools or contract for services to ensure that you have matching fonts for your chosen output format.
Even fonts that have the same names or very similar names may not share identical x-heights, character widths or kerning characteristics. A Times New Roman fonts is not necessarily the same as the Times font or TmsRmn, even though they share the same design basis.
Encoding
There may be differences in the characters that are included in each of the input and output fonts, or differences in the encoding used in the input and the output. AFP and Metacode are often EBCDIC encoded, whereas PCL, PDF, PostScript and HTML are normally ASCII encoded. Most modern formats now support Unicode, but when transforming legacy formats, translation tables may be required to convert hex values to Unicode, or vice versa.
Some composition tools, especially print drivers, can use non-standard, dynamically encoded font subsets. In PDF files this is referred to as Identity-H encoding. Because the encoding of these fonts changes each time they are created, they can cause issues for transformations to fonts with standard encoding. Generally, identity-H encoded fonts will need to be transformed to another custom font or to an image.
Raster Images
All standard formats support raster images. There are usually few issues transforming black and white or single color images from one format to another. If the image is scaled and the resolution of the output format is lower than the resolution of the input format or the resolution is not an even multiple of the original resolution, very thin lines may disappear and slight spacing changes may be evident.
Images may be either inline in the data or stored in an external resource file that is called during page formatting. In a transformation, external images may be put inline in the output file, or maintained as separate resource objects. In some cases, such as transforms between AFP and Xerox Metacode, a separate transform may be executed on the image files to generate the new image files to be referenced in the transformed output. If transforming print files to HTML you may have the option to replace black and white images used for printing with color images optimized for viewing on the screen. If you are generating XML output, raster images may simply be ignored, and deleted from the output.
Color Images
Color can offer more challenges, especially when converting a print format to a screen format, or a screen format to a print format. Color images for the screen are generally represented as a combination of red, green and blue light values (RGB) in percentages (0-1), decimal (0-255), or hex (00 – FF) values. Printed color images are usually represented as combinations of Cyan, Magenta, Yellow, and Black (CMYK). While there are standard formulas available for converting between RGB and CMYK color values, if the color temperature of the screen is not correctly calibrated or if the blend of the colorants in the ink or toner is slightly different, there may be a slight difference in the color that is presented in the transformed output. If possible, ICC color profiles (.icc) should be used to maintain color fidelity. If transforming from a color format to a black and white format for high-speed printing colored images may be dithered to create patterns of black and white dots to approximate the color shade. There are many standard algorithms available for dithering; if your transform software offers a choice, you may need to try several different options to find one that provides the best results with your images.
Vector Graphics
Line drawings, bar graphs or pie charts, or simple lines and boxes are often created using vector graphics. If your output format supports vector graphics (AFP, PCL, PDF, PostScript) the vectors are usually transformed to the equivalent vector commands in the output format. If the output format uses a different resolution than the input format, small openings may occur in the corners of boxes or in other graphics built from multiple individual lines. If your output format is TIFF or Metacode, the vector commands will have to be converted to raster (bitmapped) images. This can increase the file size of the output file. In some cases, multiple vector lines will be combined into a single image, but if the vector graphics are widely scattered on the page, it may be more efficient to create multiple smaller images in the output.
If converting to an XML schema confirm how the transformation handles graphics. Some transforms convert text but not the graphic elements, some require the graphic elements to be converted separately and linked back, and others simply ignore the graphic and image elements.
SHADED AREAS
Shaded areas may be generated with either single color fills or fills created by repetition of a raster bitmap pattern. Shade fills may be transformed to equivalent instructions in the output format if equivalent objects are available. If converting to TIFF or Metacode, vector fill instructions will need to be transformed to raster images. If transforming to a format with a different resolution, bitmap shade patterns may end up looking like Op Art or basket weave patterns due to scaling.
When converting a print format to a screen format bitmapped fill patterns may be mapped to single color fills to look nicer on the screen and avoid the Op Art effects. Shading may be removed to reduce the output file size or to make it easier to read text in the shaded area.
50% B&W Raster Fill Checkerboard | 50% B&W Raster Fill Scaled 80% | 50% Gray Color Fill No change when scaled |
When converting a screen format like PDF to a high-speed black-and-white print format you may need to replace color fills with hash pattern fills so that 2 different colors do not end up looking like the same shade of gray. PDF supports gradient fills, where the fill starts as one color on one edge and fades to a different color on the opposite edge. This type of object is only available in a few output formats so it may disappear or be replaced by a solid color fill in the new output.
Forms
Electronic forms contain content that can be used repeatedly in a print file even though it is sent to the printer only once. They go by different names in different formats – AFP Overlays, Metacode FRMs, PCL macros, or generally as Reusable Objects. In a transform they may be converted to an equivalent reusable resource (inline or external) or the instructions they contain may be “flattened” and repeated on each page where they are used. Some transform software requires matching form objects to be pre- created or pre-converted though separate processes.
Pre-printed forms will not automatically appear in transformed output since their artwork is not contained in the variable-data print file. If transforming to another print format, you can usually continue to print on pre-printed paper, but all spacing should be tested prior to production. If transforming to PDF or another viewing format you may have the option to add a digital copy of the pre-printed form, usually in JPEG or PDF format but other formats may be supported. Scanned images of the pre-printed pages are usually the least efficient way to add this information as the file sizes will be much larger with a scanned image than with a digital format that supports text and graphics.
Non-Printed Data
Input files may contain data that does not appear on the printed or viewed page, including indexes, AFP TLEs, NoOps, comment records, invisible text and other metadata. When there is non-printed data the best practices is to ensure that the purpose of the data is mirrored in the output file to maintain the integrity of the file. However, because non-print data do not affect the look of the output page many transforms discard it. It is common to see Comment or NoOp records transformed to equivalent non-operational data in the output format.
Non-printed data may also be used to generate index files as part of a transform process. Text in an invisible font (a font built with no bitmaps or character outlines) may appear on the output pages if the font is not mapped to an appropriate empty or white output font. Additional comments or NoOps may be added to the new output to indicate how the file was created or to document any changes that were made during any document manipulation steps.
Page Layout & Layered Objects
Page layout instructions such as margins and line spacing are normally preserved in a transform between print languages. Due to the nature of HTML page layout may not be preserved in transforms to HTML, especially if the HTML window size cannot be controlled.
Some formats allow page objects to overlap. Depending on the format the top layer may totally cover the underlying objects, but in other formats, the background is transparent, rather than white, and underlying objects may show through. Transforms may flatten all page objects into a single layer, or may group like objects together in the output – first text, then images, then forms, for example. This may cause text to be covered by an image, or text to be uncovered if the images are laid down first. Transform software may have options to maintain original order or optimize order by grouping objects.
Processing Instructions
Print files may contain printer and processing instructions, such as paper tray calls, duplex printing information, and binding instructions. If transforming to a view format like PDF, HTML or TIFF, these printer instructions may be discarded (screens are usually not 2-sided). For transforms to a print format, they may be automatically transformed to equivalent processing instructions, but they may also need to be set up in the transform configuration file, such as mapping “main” to “tray 1” and “aux” to “tray 2. Transform parameters may also be used to define what page size is associated with a page tray call, such as specifying that tray 1 contains Letter-size paper or tray 2 contains A4 paper. This information may be needed to generate appropriately-sized pages in output formats like PDF or TIFF.
Transforming for Online Viewing: Orientation and Resolution
If you are transforming the document for on-screen viewing instead of print you must consider orientation and resolution.
The majority of our documents are created in the portrait orientation. Most of our screens are in landscape orientation. If you shrink the page down to fit the entire page on the screen, it may be too small to read. If you display the same layout at a readable size, users will need to scroll to see some of the page contents. Alternatively, you may decide to redesign the documents, so that the online version does not exactly match the printed document, but still provides the information that your customer needs. In some industries, there may be compliance issues if the online version is dramatically different from the printed version.
A simple conversion of landscape pages may actually display them sideways on the screen (from top-to-bottom, or bottom-to-top) on a portrait page, and tumble-duplex print may display the backs of pages upside down. Transforms will normally need to be configured to rotate pages to make text readable on the screen. If some pages in the file are landscape and others are portrait, the transform software will normally need to analyze the direction of the text on each page, so that each page can be rotated appropriately for on-screen viewing.
As you can see in the table at the right, even our highest resolution screens barely come close to the resolution of our lowest resolution printers. Screen pixels are easily double or triple the size of printed dots. To represent on a screen exactly the same number of dots as on a printed page would require a screen 2 to 4 times the size of the printed page. If a PDF or TIFF file is scaled to the size of standard laptop screen, fine lines and fine details on images can disappear, or become proportionally thicker, due to round-off of fractional dots.
If input documents normally print on preprinted paper, matching images will need to be added to replicate the look of the paper on the screen. The transform software may be configured to bring in a JPEG or other image formats for company logos, text for forms, or add colored backgrounds. A full page PDF of the preprint design may also be used.
Dots Per Page | |||
---|---|---|---|
Screens | Paper | ||
CGA – 320 x 200 | = 64,000 | US Letter@ 240 dpi | = 5,385,600 |
VGA – 640 x 480 | = 307,200 | US Letter@ 300 dpi | = 8,415,000 |
XGA – 1024 x 768 | = 786,432 | US Letter@ 600 dpi | = 33,660,000 |
SXGA – 1280 x 1024 | = 122,880 | US Letter@1440 dpi | = 193,881,600 |
UXGA – 1600 x 1200 | = 1,920,000 | A4 @ 240 dpi | = 5,567,104 |
HD – 1920 x 1080 | = 2,073,600 | A4 @ 300 dpi | = 8,699,840 |
QSXGA – 2560 x 2048 | = 5,242,880 | A4 @ 600 dpi | = 34,799,360 |
A4 @ 1440 dpi | = 20,093,952 |
Transforming for Enhanced Accessibility
When transforming to PDF, you have the option to add features to your documents to make them more useable. Text strings can be linked to URLs to allow users easy access to additional information provided on websites (This can also be done in HTML pages). Bookmarks can be added to a PDF document to provide quick access to selected sections of the documents.
But more importantly, documents can be transformed to PDF/UA, which allows them to read correctly by screen reader software, making them available to visually impaired customers as well as sighted customers. Transform software can be configured to automatically tag headings, tables and lists. Alternate text can be added for images and correct reading order can be assigned.
Transforms can also be used to generate Braille-ready format used by dynamic Braille displays or Braille printers for visually impaired customers who have been trained to read Braille. This type of transform may be triggered based on a customer’s delivery preferences stored in a company’s customer database.
File Size Considerations
After a transform the file size will not be the same as the original. It could be as small as 1/10th of the original size or as large as 15 to 20 times the size of the original input. There are two major reasons for the differences – compression and resource placement.
For example, if you transform a single AFP (ACIF) file with all resources inline into a single PDF file (with all resources inline), the PDF file is likely to be about 50% of the size of the AFP file because the Flate compression used by PDF is a very tight compression algorithm. If you transform the same AFP file into multiple PDF files, as occurs when you convert a complete print run into individual statement documents for each customer, the total size of all the PDFs put together is likely to be many times the size of the original AFP file, because resources such as images, forms, and fonts are needed for each individual PDF file rather than stored once and used multiple times. File sizes can also increase if text or vector graphics are transformed to bitmapped images, such as occurs in transformations to TIFF.
Some transform software allows you to tune the output files size by setting parameters for image DPI, vector/raster formats, compression options, and resource placement (external or inline, on each page or as reusable objects). File sizes can vary greatly based on font selection, the number and size of embedded images and color fills, and the use of reusable or external objects. Always make sure you run a variety of tests with your own data to determine the appropriate amount of disk space needed for your transform output.
Planning a Printstream Transformation
There are three main steps to consider in setting up a print file transformation:
- Determining What You Have.
- Managing the Resources.
- Adding Value to the Output.
Determining What You Have
The format of the print file will vary depending on the composition system you are using and the brand and model of printer the file was designed for. While IBM/ InfoPrint/Ricoh and Canon/Océ printers normally accept AFP and PDF, Xerox printers may accept Metacode, PostScript, PDF, PCL or even AFP, depending on the model.
You might even be using line data or a variation of line data such as Xerox LCDS or AFP conditioned line data. Your input may be encoded in EBCDIC if it is created on a mainframe or in ASCII if it is created on a workstation or network server. There are also variations within the major print formats. AFP can be well formed MODCA-P or output from ACIF, or it may be generated by programs that do not create well- formed AFP. The AFP may also have TLE indexes. Xerox Metacode can be designed for printers running online or offline, and may be mixed in the same file as LCDS. PCL could be version 3, 4, 5 or even 6. PostScript can be level 1, 2 or 3, and can possibly even contain inline programming. Even PDF has many variations. To determine what you have, first check the output options settings in your composition system.
If this is legacy data, that may not be an option. Your document designers or printer operators may be helpful in filling in the details. Alternatively, you can try opening the file in a text or hex editor. If everything in the file is readable in a text editor, your file is probably line data. If you see $$DJDE or $XEROX$, followed by a mix of readable and unreadable characters, you probably have Xerox Metacode. If the file starts with %%PS it is probably PostScript. PDF files normally start with %%PDF. If there are unreadable characters, switch to a hex editor, preferably one that will also show you both ASCII and EBCDIC as well. If the first character is a hex ‘5A’, your input is probably some flavor of AFP. AFP records normally start with hex ‘5A’, followed by a 2-byte length field, followed by a hex ‘D3’. If the starts with hex ‘1B 45’ it is probably some level of PCL.
Managing the Resources
The next step is finding the print resources and determining how they will be handled by the transform. Print resources are all of the other files or objects used by the printer to resolve the document, such as fonts, images, electronic forms or formatting files.
Most print formats use a combination of inline and external resources. Transforms need to have access to these resources to get the best results.
AFP resources are normally stored in mainframe resource libraries, and sent to the printer with the print file. AFP resources include coded fonts (X0... or XZ...), code pages (T1...), character sets (C0... or CZ...), page segments (S1...), overlays (O1...), form definitions (F1...) and page definitions (P1...). The file names are normally limited to 8 characters. Recent AFP files may also use Open Type fonts.
Xerox Metacode and LCDS print files normally use resources that are stored on the printer. The file names are limited to 8 characters with a 3 character extension. Xerox resources include fonts (.FNT), images (.IMG), logos (.LGO), forms (.FRM), and JSL Formatting Files (.JSL). When the JSLs are compiled on the printer, other resources can be created like job descriptor libraries (.JDL), print descriptor entries (.PDE), and copy modification entries (.CME).
PCL often uses fonts that are installed on the printer at the factory and not available to the user. Different models of printers may have different standard fonts installed, making it difficult for a transform to accurately emulate the printer output. All other PCL resources are normally found inline, but occasionally external macros are downloaded to the printer prior to print time. PostScript files also use printer-resident fonts and may use internal or external macros.
If the resources are available, the document transformation engine can normally convert them to matching PDF objects on the fly. However, converting fonts on the fly can be inefficient. To solve the problem many transform programs use font and resource correlation tables.
In a Correlation Table the user can define what output fonts they want to use to match each input font. If a PDF input file uses Helvetica, Times-Roman and Courier (and their bold, italic, and bold-italic variations) the fonts can be mapped to the standard PDF Base-14 fonts. If the input file uses older fixed pitch fonts or more decorative fonts chosen, matching TrueType or Type 1 fonts should be found (purchased or created) so they can be embedded in the PDF file by the transform. That way the fonts will be available for proper viewing of the PDF. This will increase the size of the PDF file, usually by 20-30KB per font.
Similar processes apply to other transform pairs.
Adding Value to the Output
When converting a print file to PDF it is possible to add resources to make the document more useful and appealing to the end user. For long documents bookmarks can be used to create a clickable Table of Contents for the PDF. URL links can be added to create links to the company website.
For transforms to PDF and some other file formats it is usually possible to change font color and background color as well as to add or delete images.
If inserts are usually included in the envelope with the printed document most transforms support the ability to selectively add new pages to the file to permit replacement of those inserts. It is usually possible to add or delete text, such as disclaimers or watermarks. Most transforms support the ability to mask personal information like Social Security Numbers or an account numbers.
If the documents are going to be loaded into an archive or ECM system most transforms support the ability to output separate files for each account or customer, and to create an index with names, dates, account numbers and other information that can be used to load the print or PDF files into the archive and make their retrieval easier. If there are redundant pages in the input, like the “How to balance your checkbook” pages in bank statements, or blank or banner pages, they can usually be removed.
Consider all your options when planning a print file transformation. The more you know about your input data, and the resources it uses, the better off you will be. When configuring the transformation using a set of sample data that accurately represents all the variations in your input data will help prevent issues from arising when you move to production. Think about what how the final files will be used, and what might need to be added or deleted to make the file appropriate and valuable for the user.
Preparing for Transformation
When you have decided that a transformation is what you need to move a type of file or data from one format to another you will have to do some investigating into your own current processes and where the transformed data will be used. If the newly transformed data is going into a different workflow or the same workflow, there will be considerations regarding file sizes, resources and even if the process can interpret the incoming new format.
Investigate How the Data Correctly Prints
The first thing to do is trace the entire path the data takes from generation to final output on a printer or an archive. This means following the path the data takes from data entry, extraction from a data base, or composition system through any tools or utilities, no matter how simple. This can include file transfer, printer controllers, other transforms, or any in house tools for carrying the print towards its final goal. All of these can be adding to the original data.
There are many reasons to do this tracing.
- If your print data is part of a legacy system that may be generated by COBOL or similar programs on a mainframe it may not have been touched since it was designed. This increases the odds that there is some post-processing to update or enhance it. It may already have some form of transform or secondary program used to alter or manipulate it to work with more complex printers or finishing equipment. Learning what manipulation has already occurred to this data will help define what needs to happen with a transform you will be implementing.
- Learning about these tools will also help in ensuring that the format you create will work with existing tools. For example, you may have an indexing stage that requires TLE index data in an AFP (AFP Data Stream) file. It is certainly better knowing that before you discover the data is not indexable at the next step.
- Learning how the data prints and what it looks like prepares you for other questions. The data may be printed on pre-printed forms (roll or cut sheet). Do you need to mimic the preprinted forms in your final transform? Do you need to be concerned with tray or bin paper pulls in emulation of what you are doing currently. If going to an electronic format, you should consider that you might want to emulate the pre-printed forms with electronics overlays.
- It will be easier to answer vendor questions about your data as they should be asking these same questions.
This review will ensure that you know what the final output should look and how it should work in the existing system or adjust for a new system.
One other reason for following the path the data takes is the potential for an intermediate step to manipulate the data. It might be simply assumed and no longer thought of as a step. Especially in the case of legacy data, a simple program might be just a job step in your MVS JCL but it may be changing all the carriage controls to fit a target print. You might have line data on a Z0S or mainframe system that is in EBCDIC. If the data is moved to a server or workstation, it might be translated to ASCII. If that is what is being sent to the printer, that is what the transform will probably process. If the transform receives the EBCDIC data, changes would be
needed to adjust to that format. If the data goes through a print controller, the format of carriage controls could be affected and changed to match a target printer. If the transform receives data, from before the control changes it, it would not transform in the same way.
Determine the Data Format
The print data you have could be one of many formats. Making an assumption that is it some form of Xerox format because it is intended for a Xerox printer might be a mistake. There are many printers that might have an interface that translates the data for that device. The Xerox printer could take in AFPDS data. To be sure, you will need to examine the data, preferably with a hex viewer. You should have a hex editor available to you when determining what sort of print data you have. This is because data can be EBCDIC, ASCII or both and may consist of readable text or data in an encoded format or both. The following section has some methods to identify them.
Formats
Print data can be in different formats. The most common are the following.
AFPDS formatted data or AFPDS: This contains x5A carriage controls in the first byte of each record. This data can have indexes called TLEs. The text data is often in EBCDIC but it can be ASCII if it is generated by a composition system. Often this data has been processed by an IBM Utility called ACIF (AFP Conversion and Indexing Facility.) which also can ensure that all of the resources are in line with the print data. Compositions systems also often insert the resources inline.
You can see EBCDIC line data that is printed by Ricoh/InfoPrint printers through the use of a pair of print objects called a formdef and pagedef. If you see EBCDIC line data in an environment with AFPDS printers, the JCL that routes the data to the printer will show these two resources. Most transforms can process this line data. AFPDS can be printed on many brands of printers depending on the printer controller.
No matter what type of data, the print instructions are sent with the print data to the final device. They may be inserted by a composition, transforms or by the printer management software.
Xerox print data: Xerox toner printers have a hard drive that is used to retain resources such as fonts and forms. These printers use a source file called a JSL which is compiled into a series of objects that contain printer definitions such as font lists or job instructions. When transforming Xerox data of any format, the JSL used on the printer should be provided. Some transforms can read the compiled objects, but most can read the source files.
Xerox printers can take three formats of data natively.
Metacode. Move to the middle of the file in a hex editor and look for ASCII text. If you see ASCII text and instead of simple blanks between words you see hex values like x06, then the data is full composed Metacode. The print file might start with EBCDIC data that passes information to the printer but that same data is informing the printer that the rest of the file is ASCII text and formatting code. The beginning of the file could be ASCII also, in which case all the text is ASCII. If after reviewing the file you cannot find any readable text after the beginning, you should investigate to see if the file was transferred to a workstation as text and translated. If so, it is not usable as the encoding values are now useless. The sample below is showing Metacode ASCII text. The values between the words are command metacodes for spacing and fonts.
LCDS or Line Conditioned Data Stream. The data will most often be EBCDIC on a mainframe. The data will derive formatting information from both the job started on the printer (the JSL) and from DJDE commands in the print file that can format the data and invoke forms. This data is most often fixed length. If the data is ASCII, examine it in hex and see if there are any carriage return/line feeds, which are hex values x0D0A, after each line (x0A for UNIX). This will mean the data is translated as text during a file transfer. This might work with a transform, but since it is not the original format it is strongly recommended not to use it as any inline graphics or font index values that are based on half bytes will be destroyed. The example below shows an example of a LCDS file starting with printer instructions.
Line data. Plain line data is simply that. Usually EBCDIC, with fixed record lengths on the mainframe. All formatting is done by instructions loaded on the printer in a JSL source file. When a job is released from the print queue/spool, a job started on the printer will contain all the information needed to format. Below is an example of EBCIDC line data (in a hex viewer) or line data.
PCL Print Data: PCL is ASCII based and often begins with ASCII text saying: @PJL ENTER LANGUAGE = PCL. All PCL commands start with an ASCII hex value of x1b, which is called an “escape character. When viewed in hex you will see this value often. PCL can also contain HPGL (Hewlett-Packard Graphics Language) commands which are derived from the HP plotter printers. This language can be used for fonts, but is most often used for graphics. Files often contain both formats. PCL generally contains the graphics and font data needed to print except for one important distinction. PCL printers of any type have built in scalable fonts. The system building the print data may be aware of those fonts and will not include them into the print file, simply refer to those fonts. This is the normal method used by Windows print drivers. Transformation software will need to address this situation.
PostSript Print Data: Post Script is a programming language and can be complex to read. It is ASCII and can contain fonts and graphics. Like PCL, it can contain references to fonts that exist on a printer and also include font data to download to the device. Any transformation software will need to address these missing resources. Post Script files often begin with a string like “%!PS-Adobe-3.0” indicating the Post Script level.
PDF: PDF is Portable Document Format. Created as a viewing format it is now used commonly for print and view. In some workflows when PDF is printed there may be an interface through one of the Adobe Print Engines or products form Global Graphics that RIP (process to raster images) the file for print. PDF, like Post Script and PCL can have fonts embedded or inline and also simply referred to because they exist on the creating server’s system. These are commonly called the Base 14 fonts. PDF will contain a header with the flag %PDF- and the version of PDF.
Special consideration: If the files you are processing are the product of a process that takes individual files and concatenates them to print or process more efficiently there may be many duplicate resources in these files. For example, a Post Script printer may not care if each individual document has 20 fonts inside a print file that contains 40 documents, but a transform might take every set of 20 in each document and multiply by the number of documents in the print file and create that many output fonts. This can have a dramatic effect on font numbers, file sizes and transform times. The AFP target may not want 300 fonts which can be detrimental to the transform performance. There are ways to deal with the issues if they are known ahead.
Where will the Transform Run?
If files must move between physical computing platforms there are some considerations. If the data and resources are on a mainframe and they need to be moved to a server platform there is a specific protocol to prevent distortion or corruption. As a general rule, movement from server to server, as long as it is done in a binary format is usually correct.
Xerox –Moving from Mainframe to Server
Do not translate any print data. Move the data in a binary manner. If the data is in a fixed length data format, note that length for the transformation software. If the data is a variable length dataset, there will be some sort of length field prepended to the records so the software can process in record order. The transform vendor should be able to address this. The compiled Xerox resources: Forms (FRM), fonts (FNT), logos (LGO), images (IMG), Page Descriptor Entry (PDE), Copy Modification Entry (CME) should also be downloaded binary and should not need a length field. The Xerox source file, the JSL should usually be readable on all platforms so it should be downloaded as text and readable on the server.
AFPDS –Moving from Mainframe to Server
Do not translate any print data. Move the data in a binary manner. If the data is in a fixed length data format, note that length for the transformation software. AFP files contain their own internal length fields so no additional length files are needed so the files can be downloaded binary. This is true of the resources; download them in a binary manner too. The only exception is line data with some x5a records. This is not at all common but the data may need special considerations by the vendor to download them.
Things to Consider
- Know where the data to be transformed comes from, and learn what steps it goes through before it reaches its final form currently.
- What will the data be expected to do or have done to it after the transformation?
- If it is being transformed already by another process, have matching sets of data for the new process to test with, as any new transform will be compared to the old.
- Get an overview of where various processes occur and what mechanisms are used to move data from system to system
- Understand what “resources” mean for your data, and where these resources are located.
- Moving or acquiring resources for a transform can be complex.
- ○Map out the path the transformed data will take in its journey to final format.
- There can be further manipulation of the data by other processes with standards the new format will need to meet.
- File sizes or other considerations that affect storage or file movement may impact the process.
Educating yourself on you data and the processing it currently goes through will enable you to enter the transform process with more assurance of success. Learning as much as possible of the later path of that data will ensure any new process also succeeds.