I was lucky enough to read Aiden & Michel's original study, "Quantitative Analysis of Culture Using Millions of Digitized Books," when it appeared in Science on 14 January 2011. It was an astonishing piece of scholarship, one of the rare papers that divides an entire branch of human learning into "before" and "after." I felt the hair on the back of my neck rise as I read it. In essence, they mined through the Google Books database to answer concrete questions about linguistics, culture, politics, even topics such as the nature of fame and the pace of propagation of new technologies. It was a tour de force.
The title of this book, "Big Data as a Lens on Human Culture," suggests that it will be a general text on Big Data, but it is not. It covers only this body of work by these two researchers and their assistants.
The book repeats the contents of that 2011 article, explaining the results for the general public, adding some discussion of the origins of the work and the researchers' thoughts about the future. In the process, they expand the original piece, which was about six pages long excluding notes, to about 220 pages. Some of the new material is fun; I got a kick out the story about a romance novel that had been alphabetized and the information that could still be gleaned from it. Others seem like padding; who cares about this history of lexical concordances?
It's a shame that Aiden & Michel wrote this book themselves; the same material coming from a third party would not have seemed so self-congratualtory and, sometimes, smug. Stylistically, the book has some flaws, including an odd 'cutesy' tone and repeated reliance on lousy puns for humor (see for example the discussion of the plague sent by God to punish King Samuel in the old testament, and the authors' rather forced questioning if our decisions will similarly "come back to plague us").
Readers who make it to the end of the book may find the last couple of chapters a little disturbing. Aiden & Michel seem to lament that anything goes unrecorded, that anything is forgotten. But forgetting is healthy and can be vital to society. Though they include a few paragraphs on possible abuses of big data, this is clearly an afterthought and I suspect these guys read about NSA databases of e-mail and text messages and thought "if only we could read those too!"
The book has ample footnotes for those who want more detail, and many excellent graphs. I wish they had provided a footnote for the software package they used to generate these, the design is quite nice while remaining unusually clear. Edward Tufte would approve.
In summary, read the original paper if you have access to it. If not, give the book a try. The original can be found in Science 14 January 2011: Vol. 331 no. 6014 pp. 176-182.
Uncharted: Big Data as a Lens on Human Culture, is a fun look at a pretty amazing research project. Starting as graduate students, authors Erez Aiden and Jean-Baptiste Michel wanted to use big data to answer interesting questions. What started out as a simple research question ended up jump starting the authors' careers and an entirely new way to look at big data.
They came up with an idea to make a tool that could query Google's digitized library in order to determine word frequencies. Using the tool they invented, called the Google Ngram Viewer, they have been able to answer interesting questions that relate to word frequencies, explore how language changes over time, assess the adoption of new technology, assess fame, and conjecture as to how the answers to the questions they pose reflect on the prevailing culture.
Although the idea is simple in concept, it wasn't so simple in execution. They had to wiggle their way into the Googleverse to get permission to use the database, write a lot of code, and iron out certain legal/copyright problems. But once all this was done, the magic began.
I won't go into detail about their findings, but suffice it to say, they not only created the Ngram Viewer but used it intelligently to come to some very interesting (and often humorous) conclusions. Their analogy of Ngram Viewer as a modern equivalent of Galileo's telescope is an apt one. Without the telescope, Galileo couldn't have made some of his most important astronomic observations. Without the Ngram Viewer, it would be much impossible to look at; things like the transformation of irregular verbs over time or get a good idea when writers really started to refer to The United States in the singular (the results are surprising).
However, like Galileo's telescope, the Ngram Viewer is still a somewhat primitive tool. First, it is limited by the number of books that are available in Google's digital database. (Google is trying to digitize every book in existence, but it still has a ways to go.) Second, the authors limited themselves to books only--they did not look at other printed media; digitized periodicals or newspapers, for example. Third, the database is limited to the printed word and does not include usage in other media. (A fourth limitation that the authors did not mention is that Google is only trying to digitize one edition of each book. Therefore the database doesn't account for a book's popularity or circulation. Obscure scholarly books would therefore be weighted as equal with popular novels which introduces a certain bias as well. Moreover, there are books that are quite popular that people buy and read voraciously but have little social impact beyond a short period of time (e.g. Fifty Shades of Grey), extraordinarily popular books that lots of people may have but never read (Godel, Escher, Bach), books that come out in modern editions but are written in an archaic (e.g. The Bible).
So there are limitations. Still, if archeologists can garner incredible insights about the past by looking at the contents of ancient waste dumps, even with limitations there is a wealth of information that the Ngram Viewer can tap into that is there for the taking.
Aiden and Michel write with a great amount of scholarship, humility, and humor. The book was a easy and quick to read but insightful as well. They do spend a fair amount of time on the trials and tribulations of how they developed the Ngram viewer. This history of the Ngram viewer takes up a fair amount of copy and interesting to read. However, the insights that the authors are able to obtain actually using the Ngram viewer are far more interesting and I would have been more than happy to read about more of them. If there is one major downside to the book, it is that the authors got me turned on to the Ngram Viewer, which is majorly addictive and can consume a lot of your time. (Once you start, it's hard to stop. Trust me, this is highly addictive and more of a time sink than Twitter or Facebook.)
All in all, an insightful, engaging, interesting, and entertaining read. Highly recommended.
This book began as a scientific paper, and I think perhaps it should have stayed in that shorter form. The authors try to spin their ngram research to book length and it stretches somewhat thin. The beginning of the book is promising - the depiction of irregular verbs becoming regularized over time is interesting to anyone who looks forward to the word of the year, and I confess to being fascinated at what can be revealed by alphabetizing the words in a novel. Past that point, however, vignettes and anecdotes became more disjointed, and I'm not a fan of the authors' style of humor. Frankly, I enjoyed pondering the ngram graphs in the back of the book (the occurrences of "turnip" vs "tomato" over time, "slavery is" vs "slavery was", "werewolf" vs "zombie") more than the majority of the text.
on January 30, 2014
Two young research scientists from Harvard University, Erez Aiden and Jean-Baptiste Michel teamed up with Google in 2010 to create the Ngram Viewer. It sifts through millions of digitized books and charts the frequency with which words have been used. On the day that the Ngram Viewer debuted, more than one million queries were run through it. Some consider it to be at the center of a major revolution.
In an interview with Studio 360`s Kurt Andersen, Aiden and Michele said how pleased they are that the new technology can open up academic research to the "independently curious."
"It's good that a tool that's at the leading edge of science can generate so much enthusiasm in the general public." Michele cautions however, "it's inevitable that a tool like that will generate a large number of discussions that are actually irrelevant or that are flat-out wrong . . . it's still important that bona fide experts are the ones interpreting the research." 
In their new book Uncharted: Big Data as a Lens on Human Culture, however, they are nowhere near so humble about the so-called "big data revolution," nor are they convinced about the value of "bona fide experts."
"At its core, this big data revolution is about how humans create and preserve a historical record of their activities. Its consequences will transform how we look at ourselves. It will enable the creation of new scopes that make it possible for our society to more effectively probe its own nature. Big data is going to change the humanities, transform the social sciences, and renegotiate the relationship between the world of commerce and the ivory tower." 
Well, if for whatever reason this is going to be a contest between capital and academia, or academics versus the "independently curious," then let's hear first from the so-called "ivory tower." The following passage is from Simon Schama's introduction to his The Embarrassment of Riches: An Interpretation of Dutch Culture in the Golden Age:
". . . there is nothing especially daring about a working definition of culture drawn from social anthropology. I follow the kind of characterization offered by Mary Douglas of cultural bias as "an array of beliefs locked together into relational patterns." In the same essay, however, she cautions that for those beliefs to be considered the matrix of a culture, they should be treated as part of the [social] action and not separated from it." I have tried to follow this rather Durkheimian command in what is, essentially, a descriptive enterprise that emphasizes social process rather than social structure, habits rather than intuitions. Acting upon one another, beliefs and customs together form what Emile Durkeim called "a determinate system that has it's own life: . . . the collective or common conscience . . . it is by definition diffuse in every reach of society, Nevertheless it has specific conditions that make it a distinct reality." 
Now, let's hear from the big data revolutionaries:
"Consider the following question: Which would help you more if your quest was to learn about contemporary human society--unfettered access to a leading university's department of sociology, packed with experts on how societies function, or unfettered access to Facebook, a company whose goal is to help mediate human social relationships online?"
"On the one hand, the members of the sociology faculty benefit from brilliant insights culled from many lifetimes dedicated to learning and study.
"On the other hand, Facebook is part of the day-to-day social lives of a billion people. It knows where they live and work, where they play and with whom, what they like, when they get sick, and what they talk about with their friends. So the answer to our question may very well be Facebook. And if it isn't--yet--then what about a world twenty years down the line, when Facebook or some other site like it stores ten thousand times as much information, about every single person on the planet?" 
Aside from the vague and uninformed illogicality that pervades Uncharted, I am particularly struck by the air of self-congratulatory triumph that permeates the entire book, suggesting that big data has already won--hands down.
Why are so many enthralled by this stuff? All I can say is, "In the land of the blind, the one-eyed man is king."
 from Studio 360, Public Radio International, broadcast August 9, 2013.
 Aiden, Erez; Michel, Jean-Baptiste (2013-12-26). Uncharted: Big Data as a Lens on Human Culture (Kindle Locations 133-137). Penguin Group US. Kindle Edition.
 Simon Schama. The Embarrassment of Riches: An Interpretation of Dutch Culture in the Golden Age. New York: Random House, 1987, p. 9.
 Aiden, Erez; Michel, Jean-Baptiste (2013-12-26). Uncharted: Big Data as a Lens on Human Culture (Kindle Locations 185-189). Penguin Group US. Kindle Edition.
This is essentially an interactive book, because as you read, you can go to the Ngram site and try out your own queries (or try the link in my comment below). This book is a lot of fun to read but is also quite interesting, though it could be better (below).
The Google Books Ngram Viewer has access to the words and phrases used in a significant chunk of Google's digitized book archive, By using it, you can ask questions about things such as the changing use of language, the rise and fall in popularity of ideas, or celebrities, and so forth. For example, I found that early uses of the phrase "rock and roll" were in accounts of traveling in rickety wagons, and in sailor songs. It is a great way of viewing social and cultural history through the lens of big data.
This book consists of some of the interesting ideas the researchers uncovered, alternating with the story of how the Ngram Viewer came about, and the issues they dealt with in doing so, such as privacy, and copyrights of the authors, access to the archive of information, and so forth. Along the way they utilize a number of interesting episodes of history, e.g. Helen Keller's open letter to the German people in the 1930s. Or, they analyze the corpus to show that there are over a million words used in the English language, but the Oxford English Dictionary has only about half a million of them. It instructive to see how they wring information out of this data.
Also, they discuss some of the foibles of the data, for example, one of the most mentioned people is an academic that no one has heard up -- this is because published books are skewed towards academic content. It closes with a brief discussion of how access to big data changes the questions we can ask and what is knowable. In the appendix they show charts of additional comparisons, but at that point, it is really more interesting to go to the site and input your own queries.
For an example of the kind of results it produces, and how easy it is to use, look at the link I've put in the comment section below, where mentions of different book review publications over time are charted, along with mentions of Amazon.
A glaring omission is lack of any information about how to do advanced searches, such as constrain words by part of speech, using wildcard, what does "smoothing" mean, etc. However, googling for "ngram advanced" will lead you to the online documentation. Additionally, I wouldn't have minded some more technical information about how things were implemented. Also, at points, the book gets a bit diary-like, and could use some tightening up.
However, setting tools like this loose for anyone to use is a game-changer, and thus for those interested, it is a five star book. Try it yourself!
Erez Aiden and Jean-Baptiste Michel are interested in word and phrase frequency and what it can reveal about history and culture. They illustrate their approach with a timeline graph of the phrases "The United States are" and "the United States is." We are unsurprised to see the "is" phrase increase in frequency after the Civil War, as the "are" phrase fades from view. This example supports our intuitions about allegiance to the Union supplanting allegiance to one's home state. It also builds our confidence in their historical profiling method for those other times when it finds a counterintuitive result.
The authors are confident in the value of historical word frequency analysis. "Big data is going to change the humanities, transform the social sciences, and renegotiate the relationship between the world of commerce and the ivory tower." They begin searching for larger and larger collections of text to analyze. They eventually wind up in the office of Peter Norvig, Google's Director of Research. They convince him to grant them access to Google Books, a tremendous digital library containing more books than have ever before been collected online. Not only do Aiden and Michel spend several years conducting historical-linguistic research, but they also author a tool (available at books dot google dot com forward-slash ngrams) that allows everyone else to do the same kind of studies.
Their book outlines how word and phrase frequency can be used to learn about cultural and historical change. It tells the story of Google Books and how the authors began to use this collection of digitized documents in their research. And it provides examples of interesting trends they have brought to light. Examples include:
- Tracing the relative "fame" of Neil Armstrong and Buzz Aldrin following the 1969 moon landing.
- Illustrating the effect of official persecution by tracing references to banned European authors before, during, and after World War II.
- The same approach is used to illustrate the effect of Hollywood blacklisting during the McCarthy era.
- The effects of "flashbulb" events such the sinking of the Lusitania in 1925, the Japanese attack on Pearl Harbor in 1941, and the 1972 Watergate scandal.
- Graphs of the relative popularity of various world population centers (cities).
- The explosive increase in use of George Carlin's "seven words you can't say on television."
The book introduces some of the techniques of text analysis and "big data" in an accessible way. However, it is lighter on methodological detail than I would have liked. Having stimulated my interest, the authors might have done more to teach me how to do their kind of trend analysis. I have to forgive them because of the extensive and readable Notes section at the end of the book. There is a lot of information here that I am still digesting. Slowly, I am learning more about their methods.
This book is worth reading, particularly if you are interested in history, culture, and language. Be sure to check out the authors' online ngram tool, too. It's worth spending some time with.
on March 6, 2014
As someone who was employed more or less full time since my college days making computers do what other people, researchers in academic institutions, computer timesharing vendor, then software and services company, and finally modern "groupware" clients, wanted to accomplish, I have tracked the emergence of "Big Data" and the automated search refinement algorithms. Back in the 1990's, I worked in the "skunk works" of a fading software development and services company that had pioneered in "touch screen" and automated text based search and retrieval options for businesses. In those days we did not have the vast, inexpensive server farm full of data to mine, so one of my personal projects was what I called "content based garbage cleaning" which used stored search statements to automatically 'throw away" dated material in a data repository, thereby keeping its' size and search speed manageable for the equipment we had available. The kind of work these Harvard social scientists did as a demonstration of what is now possible fascinates me. The one "issue" I have with their approach is a rather esoteric but essential issue that has to do with "sampling bias." The universe of text data which was made available and analyzed suffers from a whole range of possible sources of selection biases. Their interpretations cannot speak authoritatively to the differences in "living" cultures where individuals now use some of the same tools to selectively filter and focus what their culture contains on a day-to-day basis, but only hint at what can be learned from repositories of digitized English language books.
on January 19, 2014
This book is a lengthy (and verbose) version of an article published in Science three years ago. Its style is often annoying. The authors suffer from the "Freaknomics syndrome": they feel compelled to be simultaneously clever and funny, and (as expected) they overdo it. Besides, many of the examples used are idiosyncratic and unconvincing. But all in all I enjoyed the book and learnt a few things from reading it. The authors are to be congratulated for creating that marvelous toy: the ngram viewer.
When we think of big data -- and in my field we often do -- we think of Retail Link or Google Analytics or Nielsen's research. All of these examples have the size and speed of data input that makes big data big, but these, and most other real world examples, are about current behavior of people, often in the context of marketing and purchasing behavior. This book deals with a very different set of data and a very different set of questions.
Google has created searchable versions of millions of books from the past, and the authors got access to them and developed a tool that allowed them to use this enormous dataset to gain insights into history. What's especially wonderful about this is that you can use this tool, too: [...]
The book tells the story of the scientists' journey to creating this tool, which is quite a good story. It also gives examples of the kinds of conclusions one can draw from the information the tool harnesses:
* Exactly when have changes in language taken place?
* What effects do political suppression of art and science have on the availability of information?
* How quickly do cultures learn about new technology, and is that changing?
These are fascinating questions, and the evidence that they can now be answered more accurately than ever before is impressive. However, the most compelling thing about the book is that these are merely examples of what this new big data source could do for our understanding of history and culture. The book ends with a collection of examples of graphs generated by the Ngram viewer, and they should certainly inspire social scientists, linguists, teachers, and others concerned with history, culture, and language to imagine many more applications.
The book is written in a playful style which might make it more accessible to readers, but which also might obscure the importance of the work for some readers. In the interests of full disclosure, I should admit that my degrees are in Linguistics, I teach literature, and I work in technology, so I may be the perfect audience for this book. However, I think many readers will find the story engaging, the examples eye-opening, and the prospect of using the Ngram viewer exciting.
on October 28, 2013
Aiden & Michel focus primarily on the pre-database 'science', then the story and algorithm development surrounding the Google Book project database. Google Books is the largest dataset library of `words' on the planet. What might this dataset reveal? The author's had a role in Google Books and the Ngram algorithm toolset. The authors reflect entertaining observations of big data's potential. Uncharted includes an introductory history to the early 20th century science of word and language. Their history includes interesting but significant obscurities like Zipf's Law and the Haney index among others.
Big data is described as a new kind of science. This notion is not easily understood to be different until you look very closely at it. First, classical science observes nature, asks questions and designs experiments to reveal repetition. Second, the scientific method begins with the hypothesis to establish theory then proof .... unfortunaely societal trends defy predictive methods. The notion of 'word', or expression or concept and the application over time doesn't conform to scientific, quantitative, normative notions. So, the 'science' of big data behavior, independent of fully understanding underlying causalities and correlations, will be hotly debated. The Google Books library can tease notional theory from the current and exponentially growing data that humankind digitizes 5 zetabytes of data/yr at this moment.
The authors use their experience from Ngram development to provide simple 2D graphical images. The basic picture is of the frequency of a notions occurrence per billion words, or sets of words, over time.
So, as an example ... and it's just about Halloween, I can test for myself what I learned in "Uncharted" ... I can ask, "Is the world talking more about `vampires' or `zombies'"? I just did an Ngram and can empirically state that vampires have been and continue to be the most discussed of the two societal threats. Further, vampires have increased to about a 4:1 lead over zombies in the database as of 2008. I can also see that "zombies" first appeared in the 1940's; `vampires' seem to have been around for ages or at least since before 1800. "Vampire" occurrences are rising exponentially; zombies arithmetically. We will have to wait to see if "World War Z" will affect the published world's threat assessment of zombies.
The authors expand the notion to real world considerations. Several specific `trend sets' were very interesting to this reviewer and they've consumed some post-read thinking time. There is very much something here in the Uncharted big data world. It can be seen that after more than 200 years that we seem to have converged on the "chicken" vs "egg" and that's pretty big in itself.
The authors are interesting young gents who inject astute, off-the-wall commentary and deliver some very clever notions but ... the book reads as if novice writers are behind the word processor. If the author's purpose is the novice style, they succeeded. Some might find the lilting tech style off putting. The writing style aside, it's rare that I can read a book out of curiosity that delivers practical, ongoing self-entertainment skills. The fertilizer-for-the-brain effect of the appendix charts is worth the price of the book. You will need to decide for yourself to appreciate the unusual writing style.