Monday morning, 28 November 2005, in a court room in San Diego, soon to be former Congressman Randy 'Duke' Cunningham stands before a federal judge and pleads guilty to one count of criminal conspiracy related to bribery of a public official and one count of tax evasion (related to said public official not reporting bribery-related taxable income).
The legal proceeding officially recognized and endorsed the results of a months-long investigation into the Congressman's activities dating back to the year 2000. Triggered by a June 2005 newspaper story on the circumstances of the sale of a house there had been a flurry of stories about the Congressman, his generous friends, his yachts, his real estate investment acumen. In the Plea Agreement before the court, the US Attorney's office for the Southern District of California transcribed their version of what happened in careful legal formalisms, chock-a-block with Cunningham endorsed and allocuted fact.
As story, the published Plea Agreement is structured for legal standards, formatted for a trained and technical legal audience. As words, this formal telling is neither too obscure nor technically dense for the rest of us. A bit dry maybe, it compensates with facts for what it lacks in color. As words, too, this version of the story is an official public record.
The Plea Agreement was soon available on the Internet as a PDF document. Whatever the reasons for using the proprietary Adobe format for publication, it is not the best format for most web users. Why? The general answer is that web documents have inherent capabilities for telling a story that aren't available to Paper&Ink media. To illustrate, here are links to the PDF document and three more webby inplementations. In all four examples the words of the Plea Agreement are identical.
FindLaw.com published the Plea Agreement on the Internet as a PDF, the proprietary Portable Document Format developed by Adobe Systems. From Wikipedia:
Each PDF file encapsulates a complete description of a 2D document that includes the text, fonts, images, and 2D vector graphics that compose the document. … This feature ensures that a valid PDF will render exactly the same regardless of its origin or destination.
Stored on a computer filing system, a PDF is a BLOB, a Binary Large OBject, a chunk of contiguous bits, a long string of binary 1's and 0's. So of course, it's a blob too, an indivisible unit. The PDF data BLOB is meaningless unless swallowed whole, translated, and rendered by the PDF reader program supplied by Adobe Systems.
The PDF reader displays a faithful reproduction of the original document. In this particular case it is a picture of words, the computer file it is an encoded image. The PDF reader provides controls so the reader can navigate page to page or jump to an arbitrary page. Typical web browser controls like PageUp, PageDown, and text search are not available. It is a picture, not text.
Dan Anderson of www.dukecunningham.org built this HTML version of the Plea Agreement. HyperText Markup Language documents encode both the textual content and display information as text. HTML documents are transmitted from servers to client browsers as text files. The client web browser decodes the structured text file, renders it according to the embedded instructions. Humans, with a little practice, can read HTML documents.
Anderson used an Optical Character Reader scan of the printed PDF document to extract the text content, added HTML codes to control the display of the text and provide both page and topical navigation to the document.
(How much effort? Add notes and comments about building the web document …)
The Document Object Model is a published standard that facilitates making real-time changes to web documents using the JavaScript programming language. Real-time response makes it possible to give the viewer enhanced control over the pace and sequence of document display. Effective access control unburdens the author of anticipating every viewer's particular interest. Of course, new solutions raise new problems. How does one offer controls without distracting from the subject matter? Is there an optimal presentation format?
First pass: use the DOM to reproduce the paper document. Display the same lines on the same pages. Use monospaced font. Done, 8-12 hours attention. Making the pages was, in the main, straight forward; the cover page was tricky as were places where HTML elements (paragraphs, lists) spanned physical pages. The Page access control required a small bit of simple JavaScript programming. The finished DOM page view document content looks a lot like the PDF document content.
The advantage of the DOM Page View over the PDF is mainly simplicity, fewer options are less of a burden on the viewer's attention. It requires less network bandwidth, does not require a loading custom reader module. Its display is open to client manipulations like cut'n'paste, text searching.
Its biggest problem is same as PDF, the viewer's access to the information content is determined by the width and length of the printed page. And page numbers do little to help us understand or remember the actual content of the document.
The Plea Agreement has an embedded logical organization, an outline hierarchy supporting its tale of legal fact and remedy. Because DOM/JavaScript isn't bound by the same physical constraints as Paper&Ink we can use this structure to organize the display — the display presentation of the document controlled by logical shape of its content.
The concept is simple, use the document's outline to access the document's text. Simple, but not without complication. Some categories are built up from collections of sub-categories introducing a list-of-lists effect. "II. Nature of the Offenses" has two major sub-topics, one of them is itself a list. Any effective navigation scheme has to be simple to the point of invisible for two reasons: explanatory notes and guides clutter the display; attention given to navigation is attention drawn away from the subject matter. This is a non-trivial problem.
Eventually there's a win. 16-20 hours of close attention, mainly tied up in coding the topic selection mechanism, unwinding the complexities of random access to potentially nested navigation elements.
Cons: Topic DOM loses page number information, Duke's initials, the principals' signatures. The PDF (and presumably the original) had right-justified text margins as well left.
Pros: adds the outline, some headings changed slightly, edited mainly for space. The topic based menu provides semantic instruction and guidance not present in the other display versions. The topic menu is useful when the reader is first getting acquainted with the content; it is useful as a search reminder during return visits to the document.
The display page didn't visually coalesce until the bits of highlighting and color were added. Funny, the monospace typeface of the plea text adds authority, realism. (Conditioned response?)