A bag but is language nothing of words

From Mondothèque

Revision as of 00:13, 15 December 2015 by Michael Murtaugh (talk | contribs) (Lieber's Standard Telegraphic Code Book (1896))

In text indexing and other machine reading applications the term "bag of words" is frequently used to underscore how processing algorithms often represent text using a data structure (word histograms or weighted vectors) where the original order of the words in sentence form is stripped away. While "bag of words" might well serve as a cautionary reminder to programmers of the essential violence perpetrated to a text and a call to critically question the efficacy of methods based on subsequent transformations, the expression's use seems in practice more like a badge of pride or a schoolyard taunt that would go: Hey language: you're nothin' but a big BAG-OF-WORDS.
Michael Murtaugh


We must bring together a collection of machines which simultaneously or sequentially can perform the following operations: (1) The transformation of sound into writing; (2) The reproduction of this writing in as many copies as are useful; (3) The creation of documents in such a way that each item of information has its own identity and, in its relationships with those items comprising any collection, can be retrieved as necessary; (4) A Classification number assigned to each item of information; the perforation of documents correlated with these numbers; (5) Automatic classification and filing of documents; (6) Automatic retrieval of documents for consultation and presented either direct to the enquirer or via machine enabling written additions to be made to them; (7) Mechanical manipulation at will of all the listed items of information in order to obtain new combinations of facts, new relationships of ideas, and new operations carried out with the help of numbers. The technology fulfilling these seven requirements would indeed be a mechanical, collective brain. [1]

Bag of words

In text indexing and other machine reading applications (such as Google's core business of search) the term "bag of words" is frequently used to underscore how processing algorithms often represent text using a reductive data structure where the original order of the words in sentence form is stripped away. In a recent blog post, Michael Erasmus explains the technique in the context of "tf-idf": http://michaelerasm.us/tf-idf-in-10-minutes/:

First, let's just define what I mean with document. For our purposes, a document can be thought of all the words in a piece of text, broken down by how frequently each word appears in the text.

Say for example, you had a very simple document such as this quote:

Just the fact that some geniuses were laughed at does not imply that all who are laughed at are geniuses. They laughed at Columbus, they laughed at Fulton, they laughed at the Wright brothers. But they also laughed at Bozo the Clown - Carl Sagan

This structure is also often referred to as a Bag of Words. Although we care about how many times a word appear in a document, we ignore the order in which words appear.

While "bag of words" might well serve as a cautionary reminder to programmers of the essential violence perpetrated to a text and a call to critically question the efficacy of methods based on subsequent transformations, the expression's use seems in practice more like a badge of pride ... a schoolyard taunt that would go: Hey language: you're nothin' but a big BAG-OF-WORDS. In this way BOW celebrates the apparently perfunctory step of "breaking" a text into a purer form amenable to computation, to stripping language of its silly redundant repetitions and foolishly contrived stylistic phrasings to reveal its cleaner inner essence.

Lieber's Standard Telegraphic Code Book (1896)

Katherine Hayles devotes a chapter to telegraph code books: [2]

[...] my focus in this chapter is on the inscription technology that grew parasitically alongside the monopolistic pricing strategies of telegraph companies: telegraph code books. Constructed under the bywords “economy,” “secrecy,” and “simplicity,” telegraph code books matched phrases and words with code letters or numbers. The idea was to use a single code word instead of an entire phrase, thus saving money by serving as an information compression technology. Generally economy won out over secrecy, but in specialized cases, secrecy was also important.

Liebers P1016851.JPG
After July, 1904, all combinations of letters that do not exceed ten will pass as one cipher word, provided that it is pronounceable, or that it is taken from the following languages: English, French, German, Dutch, Spanish, Portuguese or Latin -- International Telegraphic Converence, July 1903

On the shift to the "machine-centric":

The interaction between code and language shows a steady movement away from a human-centric view of code toward a machine-centric view, thus anticipating the development of full-fledged machine codes with the digital computer. [3]

On the relation to a *universal language*:

Along with the invention of telegraphic codes comes a paradox that John Guillory has noted: code can be used both to clarify and occlude (2010b). Among the sedimented structures in the technological unconscious is the dream of a universal language. Uniting the world in netw

Google Tap

In a Google April fools "prank" (where fake product announcements are made each April 1st, reportedly the product of Google's famous "20%" time for "side" projects ) [1]

Claiming to be developed by "Reed Morse", great grandson of Samuel Morse the developer of the telegraph.

What's notable about Google's (mock) interface of telegraphy is that it although they cite people's frustrations with modern devices having"too many buttons" they end up presenting an interface with two when the telegraphic interface was just one, and was routinely operated "blind" while the operators eyes read the message and perhaps made notation on paper while. While made in jest, the misunderstanding is telling as the performative interface of the telegraphs single button, is translated to one where the essential and initial form of the message is symbolic, containing those two binary symbols "dot" and "dash".



At the 1940 New York World's Fair, the VODER speaking machine system was demonstrated in sensational fashion. The system was developed by Homer Dudley, an engineer at AT&T Bell labs.



It's far and away much more human sounding than any text to speech system of today. Why? Because of the way it's performed. Rather than starting from written language broken into approximate translation of phonetic fragments and then applying a slew of statistical and other techniques in an attempt to bring back some sense of the natural expression of a human voice, the voder system merely offers its user a palette of sounds and leaves it to the operator to perform them.

She saw me

Who saw you?
She saw me

Whom did whe see?
She saw me

Did she see you or hear you?
She saw me

Extracting Patterns and Relations from the World Wide Web

Sergey Brin, In Proceedings of the WebDB Workshop at EDBT 1998 http://www-db.stanford.edu/~sergey/extract.ps

The World Wide Web provides a vast source of information of almost all types,

ranging from DNA databases to resumes to lists of favorite restaurants. However, this information is often scattered among many web servers and hosts, using many different formats. If these chunks of information could be extracted from the World Wide Web and integrated into a structured form, they would form an unprecedented source of information. It would include the largest international directory of people, the largest and most diverse databases of products, the

greatest bibliography of academic works, and many other useful resources
2.1 The Problem

Here we define our problem more formally: Let D be a large database of unstructured information such as the World Wide Web

Data mining pre-google

A traditional algorithm could not compute the large itemsets in the lifetime of the universe. [...] Yet many data sets are difficult to mine because they have many frequently occurring items, complex relationships between the items, and a large number of items per basket. In this paper we experiment with word usage in documents on the World Wide Web (see Section 4.2 for details about this data set). This data set is fundamentally different from a supermarket data set. Each document has roughly 150 distinct words on average, as compared to roughly 10 items for cash register transactions. We restrict ourselves to a subset of about 24 million documents from the web. This set of documents contains over 14 million distinct words, with tens of thousands of them occurring above a reasonable support threshold. Very many sets of these words are highly correlated and occur often.[4]

Raw data

Tim Berners Lee and the urge to "liberate your documents"

So, we're at the stage now where we have to do this -- the people who think it's a great idea. And all the people -- and I think there's a lot of people at TED who do things because -- even though there's not an immediate return on the investment because it will only really pay off when everybody else has done it -- they'll do it because they're the sort of person who just does things which would be good if everybody else did them. OK, so it's called linked data. I want you to make it. I want you to demand it. [5]


Parallel shifts, telegraph and telephony a shift occurs from language as something performed by a human body, to becoming captured in code, and occurring at a machine scale. In document processing a similar shift occurs from language as writing to language as symbolic sets of information to be treated to statistical methods for extracting knowledge in the form of relationships.

Often when we speak of machine or computer in terms like machine reading or computer vision, we speak of a displacement of human labour ... often to apply condensed human labour typically embodied in the form of (trained) statistic models ... and a displacement of responsibility as values become encoded in the form of algorithms/software.

The interest in "machinic" (minimal human intervention) involves on first glance "machinic" in the traditional sense of automating labour, replacing the human work of categorizing with an automated process; in this way opening up the process to a larger quantity of pages and a range of "esoteric" topics which would not be possible to handle with traditional editorial processes. This "machinic" shift is a business model that learns to extract the value of web surfers behaviour; this process is then echoed in google's book digitization which similarly "extracts" / exploits the value of the collection librarian (on top of the work of the author, the typesetter, the publisher)

The computer scientists view of textual content as "unstructured", be it in a webpage or the pages of a scanned text, underscore / reflect the negligence to the processes and labor of writing, editing, design, layout, typesetting, and eventually publishing, collecting and cataloging. (cf here [6]?)

In other words, by "unstructured" it is meant: unstructured in relation to the machine -- that is, not explicitly structured in a format directly amenable to use by automated means. "Structuring" then is a process by which structure is made explicit through the use of standards of markup (such as HTML/XML). In this way, the computer scientist is viewing a text through the eyes of their reading algorithm, and in the process (voluntarily) blinding themselves to the work practices which have produced, and maintain, the given textual resources, choosing to view them as instead somehow "freely given" and available to exploit as a "raw material".
  1. Traite p. 391, via W. Boyd Rayward, Introduction of International organisation and dissemination of knowledge: selected essays of Paul Otlet
  2. "Technogenesis in Action: Telegraph Code Books and the Place of the Human", How We Think: Digital Media and Contemporary Technogenesis, 2006
  3. Hayles
  4. Dynamic Data Mining: Exploring Large Rule Spaces by Sampling; Sergey Brin and Lawrence Page, 1998; p. 2 http://ilpubs.stanford.edu:8090/424/
  5. Tim Berners-Lee: The next web, TED Talk, February 2009 http://www.ted.com/talks/tim_berners_lee_on_the_next_web/transcript?language=en
  6. http://informationobservatory.info/2015/10/27/google-books-fair-use-or-anti-democratic-preemption/#more-279