Interview: Charles Petzold

By David Wall

Known for his books on programming Microsoft Windows and his technical advice in PC Magazine , Charles Petzold has long been a favorite of the programmers among us. His latest book, Code: The Hidden Language of Computer Hardware and Software, breaks from tradition and offers a look at data representation and processing that the average Joe can appreciate. From Morse code to Unicode, Petzold's new book has a lot to say about how human beings communicate with the help of machines. What was your goal in writing Code? It's a pretty big departure from the books with which you've built your reputation as a programmer and author, the hefty how-to manuals on Microsoft Windows programming. What inspired you to write this book?

Charles Petzold: Code was not written for programmers. It is a unique tour through the digital technologies that make our computers work, starting with Morse code and the telegraph, and I wrote it for people who aren't necessarily computer-savvy.

I first conceived of Code in 1987 while writing the PC Tutor column for PC Magazine . I realized it might be possible to demonstrate how computers work starting from very simple ideas, and for almost a decade I let the book take form in my head before writing a word. Code was the most difficult writing I've ever done. But I wanted to do something more challenging than revising Programming Windows for the nth time. I also wanted to write a book that might have a shelf life of more than two years, and which my family and nonprogrammer friends might want to read. Some of the best parts of Code are the portions in which you explain how certain common types of coding (such as UPC bar codes and ISO grids on rolls of film) work. Can you decode some other everyday representation of data for us--perhaps those two-dimensional, scannable tags that UPS has been using for a while now, or ISBNs on books?

Petzold: Although digital codes are used in many modern household appliances, they aren't often in clear view. That's why Code is subtitled "The Hidden Language of Computer Hardware and Software." I haven't explored those UPS tags, however. From what I understand, each one is unique and they include some encrypted data that allows them to be locally printed by the sender of the parcel. What's interesting is that it's very obviously a binary code and you can actually determine the number of bits in the grid. That's not quite obvious with bar codes. The ISBN bar codes on books are very similar to the UPC bar codes, so perhaps the exercise of decoding them is best left to a reader of Code ! You talk a lot about ASCII code in your book. How is the newer Unicode standard different? How do the bits in a Unicode character break down?

Petzold: Unicode is something I feel very strongly about, and I discuss it at length in Programming Windows as well as in Code . Currently, most digital representations of text use a variation of a 7-bit code called ASCII--the American Standard Code for Information Interchange. ASCII is truly an American standard and was never intended to represent the accented letters used in many European languages, to say nothing of non-Roman alphabets such as Hebrew, Greek, and Cyrillic, or the thousands of ideographs used in Chinese, Japanese, and Korean. Over the years, many extensions to ASCII have been developed to compensate for these deficiencies, unfortunately all of them incompatible.

Unicode is a single unambiguous 16-bit text encoding system with the potential of representing 65,536 characters, including all the characters from all the world's written languages that are likely to be used in computer communications. The universal adaptation of Unicode is an important step in internationalizing computer use, but that's a big job because ASCII is probably the most entrenched computer standard of them all.

The individual bits of Unicode don't have any independent meaning. Instead, different alphabets and collections of ideographs are assigned ranges of codes. Codes 3840 through 4025, for example, represent characters used in Tibetan. Code has a lot to do with representations of data and information. What is your characterization of the difference between data and information?

Petzold: Claude Shannon--the inventor of information theory, and one of the important historical figures who makes an appearance in Code --never made a distinction between information and data. To Shannon (in The Mathematical Theory of Communication , 1949), information represents a differentiation between two or more possibilities, and thus can be conveyed with one or more bits.

Today, however, particularly influenced by books such as Clifford Stoll's Silicon Snake Oil (1995) and David Shenk's Data Smog (1997), we tend to say that data is the raw stuff and information is the processed stuff. Information makes sense of datPetzold: Information draws conclusions from datPetzold: Information has utility. One of the problems of the mass media and the Internet, such authors say, is that we get too much data and not enough real information.

But I don't find this distinction particularly useful in exploring the ways in which data (or information) is digitally encoded, which is what Code is all about. It doesn't really matter whether the information (or data) is useful or not. And it's really just a matter of perspective. One person's data is another person's information--particularly if it's the person's job to turn data into information for other people. What about context? This is what XML namespaces are all about. If I'm an admissions officer at a university, a value called "yield" is the percentage of accepted students who decide to come to my university. If I'm an atomic physicist, "yield" is the explosive power of an unregulated nuclear reaction. The words are the same--how do different representation schemes deal with context-dependent differences in meaning?

Petzold: That ambiguity is a potential problem in XML. The more ambiguous XML is, the less useful it will be.

But in general, bits never really tell you anything about themselves. One of the most common bugs in computer programming is called a signed/unsigned mismatch. Often in such cases a negative number is stored in two's-complement format (discussed in chapter 13 of Code ), but another part of the program assumes that it's a positive number. There's even a bug like that in chapter 7 of the fifth edition of Programming Windows ! Avoiding such bugs is a necessity of programming. Anything that reads data needs to know exactly what format is being used to store the data. The coding systems you describe--Morse, ASCII and so on--are great as intermediaries between some kind of machine and a human language, such as English or German. But the human languages still carry all the big ideas. "Four" is a pretty much universal concept, and nearly everyone in the world will recognize the Arabic numeral "4" as representative of that concept. But other universal concepts include "love," "beauty," "shame," and "greed." I mean, look at Michelangelo's "Pieta" and clearly it's about grief, but it's not much good for communicating that feeling to someone who can't see the statue. Is there any hope for universal tele-representation of big ideas?

Petzold: Morse code and ASCII can represent the word "love" just as well as written language. The bits in a waveform file can represent the word almost as well as spoken language. And the bits in a movie file can represent the word almost as well as face-to-face communication. That we get some meaning from the word is a result of its formal definition and a lifetime of shared experiences.

If a work of art imparts only emotions such as "grief," the work must be said to have an extremely low signal-to-noise ratio. That's an awful lot of marble to convey an emotion that could be conveyed just as effectively using stick figures. What we appreciate in classical sculpture is more accurately the geometrical form and proportions. Have you read Bruce Chatwin's The Songlines? The book is about the Aboriginal people of Australia, and it has a lot to say about the representation of information. There is a particular passage that's relevant here. The idea of the passage is that there are many Aboriginal groups scattered around the continent, and they have mutually unintelligible spoken languages. However, all the groups use songs to describe physical journeys. The songs work on several levels. The words describe the landmarks and the terrain, but so do the rhythm and the melody. Chatwin notes that a man from near Darwin, listening to a man from near Townsville sing about his home, can extract information about topography from the song even though he knows none of the words. The songs are systems of encoding information. They're universally understood. Have you any thoughts on this idea?

Petzold: This might surprise Igor Stravinsky, who said that music was incapable of expressing anything except itself. In this particular case, I'd be surprised if the rhythm and melody conveyed more information about typography than might be managed with simple hand gestures. Anything more elaborate would require encoding schemes that would inordinately interfere with the structural rhythm and harmony, perhaps ultimately resembling those John Cage compositions that were based on star maps.

Although the syntax of language seems to be ingrained in our newborn brains, vocabularies are obviously not. In this sense, music is more universal than language because the vocabulary of rhythm and harmony are related to the biology of our bodies--the pulsing of our natural rhythms and our sense of hearing. Harmony in particular can be analyzed as the relationship of frequencies in relatively simple integral ratios. That fact that a major fifth is 1.5 times the frequency of the tonic is culturally independent. Thus, different levels of consonance and dissonance are available to convey culturally independent antipodes. For example, a dissonant passage might convey a craggy terrain and a consonant passage a level terrain.

However, if this is solely the type of information one is drawing from a particular piece of music, it too must be said to have a low signal-to-noise ratio. What is usually much more interesting in music is the way in which the composer is using form and proportions to convey emotions (or topography or whatever) rather than the emotions themselves. "Maybe" and "kind of" are important concepts to human beings, but they're not well-suited to binary encoding or processing by logic gates. Is there a place for ambiguity in computing? Or is ambiguity, like chaos in natural systems, just order of a sort we don't yet understand?

Petzold: There's a whole field called "fuzzy logic" that attempts to combat the numerical rigidity that bits and gates seem to imply. Readers of Code might be interested in Arturo Sangalli's The Importance of Being Fuzzy (1998) for a good basic introduction to fuzzy logic. It's a topic that would have been discussed in Code had we ("we" meaning Microsoft Press and I) decided we wanted a 500-page book rather than a 400-page book. In Code , you have a lot to say about computer theory, meaning that you talk about ways of encoding values so that machines can interpret them, mechanisms for processing those values, and systems for sharing those values with human operators. But you don't say much about networks. Does computer theory change at all when you have lots of computers hooked together? Is the network really the computer, as Sun Microsystems says?

Petzold: For purposes of clarity, Code concentrates on pre-networked computers. Surely such computers have a considerable amount of utility. Connecting computers makes possible distributed processing, which is dividing a particular computer task among multiple machines. That certainly complicates traditional computer theory somewhat. But for most people using the Internet, distributed processing is just not very common. For the most part, the transfer of information is barely more sophisticated than accessing a hard drive or a CD-ROM. Unfortunately, the most interactive areas of the Internet are those designed to turn the Web into one big giant mail-order catalog. You also talk a lot about bits, which are of course the means of representing values in digital computers. What about as-yet-theoretical quantum computers, which use quantum bits, or "qubits," to represent not just specific values, but all possible values at once? If a computer can take all possible values and perform calculations on them to yield all possible outcomes, what does that say about information, or about what is real?

Petzold: My main hope in writing Code is that the reader comes away with a really good feeling for what a bit is, and how bits are combined to convey information. That's essential to understanding this digital era that we've built for ourselves. Quantum computing would have been discussed in Code had we decided to make it a longer book, but it's quite a difficult topic. And of course, as you imply, quantum computing is probably nowhere close to becoming an actual product!

I think qubits have major implications for parallel processing, but attempting to extract metaphysical meaning regarding reality leads, I think, to that type of anti-intellectual new-age pothead mysticism that passes for science writing in some circles. Our notions of reality have been altered much less by quantum theory than by the discoveries of Kepler, Copernicus, Galileo, Newton, and Darwin. Do you have any plans to write further about general programming topics, perhaps about compiler theory? It seems like a logical next step, and something your readers would enjoy.

Petzold: Wherever my career goes at this point, it almost certainly won't involve a book on compiler theory! I'm finding myself more interested in writing books that would be found in the Science & Technology section of the bookstore. These are the books I like to read and the subjects that intrigue me.