Information and Psychiatry: Digitizing the World

August 1, 1998

A traditional medical chart is a varied collection of printed and handwritten notes, forms, letters and laboratory reports. As the chart grows, it becomes more and more difficult to keep track of the information it contains.

A traditional medical chart is a varied collection of printed and handwritten notes, forms, letters and laboratory reports. As the chart grows, it becomes more and more difficult to keep track of the information it contains.

How can that information be managed? One solution is to continuously index all the information in the chart. Another approach is to type all the important information into a standard word processing document or a database. Both approaches are impractical because they would take an inordinate amount of time. There is another solution: All the forms, graphs, reports and notes in the medical record can be converted into electronic images and stored on a computer hard drive. The key to this approach is a process called digitization.

Fifty years ago, during the misty beginnings of the information age, the conceptual design of the modern computer was at a crossroads. Two competing systems, analog and digital, vied with one another for dominance. The analog system models the world on a continuous scale consistent with the infinite gradations in physical properties and behavior seen in nature. Numbers, for example, are represented by different levels of electric current, so that a circuit containing the number 10 holds twice as much current as the one containing the number 5. The main advantage of the analog system is that it can theoretically reproduce even the tiniest variation in nature by minutely raising or lowering the current.

The digital model, in contrast, literally breaks nature's continuous variation into bits. Each bit has two possible values: 0 (off) or 1 (on). Several bits can be grouped together to represent higher values. Eight bits, for example, form a byte that can have as many as 256 different values (28 = 256).

Using this system, the digital computer roughly simulates the infinite range of values represented in an analog system. There is, however, one important difference. The precision of an analog computer is limited by its capacity to detect small differences in current levels. Digital systems prevail today because they have no such limitation. The precision of digital information can be infinitely improved by increasing the number of bits used to represent a number.

The dominance of the digital model has had an enormous impact on our contemporary world view. In one sense, it is an expression of Western sciences reductionismits endless drive to control nature by breaking it into easily digestible pieces. The struggle over analog and digital computers is analogous to another familiar conceptual struggle, that between the art and science of medicine our subjective sense of a patients illness and the objective measurements of medical science. Physicians move back and forth between these two realms, and in the process they digitize the world, breaking continuous variation into sections by categorization. Most of our work in medicine is an attempt to translate the continuous variation between health and disease into a digital yes or no decision.

The digital revolution is based on the discovery that all forms of information visual images, sounds, symbols and ideas can be easily represented, stored, transmitted and manipulated using the same underlying data system composed of a defined sequence of bits. The central doctrine of digitization is that given a large enough scale of discrete values, we can mimic the continuous nature of the world to such an extent that most observers will not be able to tell the difference between the representation and reality.

All computer files are digital, but there are two different types of digital document files. In the first, a standard text file, each individual character in the text is represented by a single byte in the computer. The 256 values available to every byte are more than enough to stand for all the letters, numbers (0-9), punctuation and symbols in the English language. Since an average double-spaced, typewritten page contains approximately 2,000 characters, it can be stored in a computer file containing an equal number of bytes.

Most word processors use standard text files and add information to the file to specify formatting such as typeface, tab settings and page margins. Every computer has a code table in its memory that quickly translates each of the 256 values into a standard pattern of many dots that are displayed on the computer screen or printed on paper to reproduce the character. The process is quite economical because only the symbol for each character has to be stored in the file.

The second type of file is an image text file. As its name implies, an image text file stores a picture of the page. There is no direct coded relationship between the individual characters in the text and the bit or bytes in the file.

In its simplest form, each bit represents either a white or black dot. The collection of bits in the file have a one-to-one correlation with the vast number of dots used to construct the image of the document on a computer screen or printer. Since even a small image is made up of a large number of dots, image files are usually very large, often 25 to 50 times larger than the corresponding standard text file. The enormous size of image text files means that they take up a great deal of space on a computer's hard drive and require far more time to manipulate than the comparable standard text file. Furthermore, the size of an image text file increases by the square of its resolution. Doubling the resolution of the image quadruples the size of the resulting image file.

This requirement for enormous processing power is the reason document imaging has not been feasible on personal computers until relatively recently, and why the need for more and more powerful computers is crucial as each new generation of computers allows us to create higher resolution, and more accurate digital models of the world.

You may be tempted to ask why anyone would use an image text file, with its limitations, instead of a standard text file. The answer is that the disadvantages are outweighed by one major advantage I presented at the beginning of this article. Document imaging allows us to quickly store information in a computer without worrying about the format of that information and without having to reenter the data one character at a time.

The actual process of converting a document into an image file is simple. High-quality image scanners that plug directly into a personal computer can be purchased, with the necessary imaging software, for less than $250. Once installed, they take less than a minute to convert a document into an image file.

There is one complication. Although the process allows a user to convert any document into an image file and reproduce it with a high degree of fidelity, it makes it much more difficult to retrieve the document. Suppose you want to locate one document among several thousand stored as standard text files on a computer disk. You can write a program to search the bytes in every file looking for a specific patient's name or some other unique identifying feature. You can do this because the file's contents are easily accessible. The bytes in a standard text file intrinsically convey meaning. There is a one-to-one correspondence between each byte and a specific character.

An image text file is different. Its bits and bytes are intrinsically meaningless. They simply record patterns of black and white dots. If, by chance, the document is slightly tilted when it is scanned, the resulting bytes in the image file will be completely different than if it were held straight up and down during the scanning process. The content of the document, however, obviously remains the same! This can be a significant problem. Its nice to be able to store any document in a computer, but we would like to be able to retrieve it quickly without having to view every document in the collection.

The retrieval problem can be solved in two ways. The simplest solution is to create a list of keywords or phrases that describe the contents of each image text file. When a specific document has to be retrieved, the list is searched for the relevant keywords and these, in turn, lead to the appropriate file. This works well so long as the keywords are chosen intelligently and accurately reflect the contents of the document.

The second solution is more complex. The actual text information in an image text file can be extracted by a process called optical character recognition (OCR). It essentially reverses the initial step of converting the document into an image. The process is slow, prone to error and ironically requires more computer power than the original digitization. The resulting text file cannot be considered a valid translation of the document until it has been checked against the original. It may, however, be useful as an ancillary search file that can be used as a guide to locating the correct image text file.

When do you use a standard text file versus an image text file? The standard text file is obviously simpler and cleaner to use. Yet, in many situations, the actual format of the document is as important as the information it carries. In a legal dispute, for example, the image of a receipt, a sequence of handwritten notes, or a unique document is more convincing than a simple report of the contained information. Furthermore, image text files are not limited to text. They can hold pictures as well, including electrocardiograms (EKG), CAT scans, nuclear magnetic resonance (NMR) scans, and annotated photographs of skin lesions and operative procedures.

It is tempting, although naive, to think that document imaging will become unnecessary when we have developed a fully electronic medical record. Handwritten notes and irregular pieces of paper will probably continue to be a part of patient records for many years. It is more likely that the process of document imaging and OCR will become far more efficient, accurate, and pervasive as computers become more powerful. This should ensure that both formats will survive in medicine even when we have developed a fully electronic medical record.