Difference between revisions of "A bag but is language nothing of words"

From Mondothèque

Line 1: Line 1:
 +
(language is nothing but a bag of words)
 +
 
<div class="book"><onlyinclude>In text indexing and other machine reading applications the term "bag of words" is frequently used to underscore how processing algorithms often represent text using a data structure (word histograms or weighted vectors) where the original order of the words in sentence form is stripped away. While "bag of words" might well serve as a cautionary reminder to programmers of the essential violence perpetrated to a text and a call to critically question the efficacy of methods based on subsequent transformations, the expression's use seems in practice more like a badge of pride or a schoolyard taunt that would go: Hey language: you're nothin' but a big BAG-OF-WORDS.</onlyinclude></div>
 
<div class="book"><onlyinclude>In text indexing and other machine reading applications the term "bag of words" is frequently used to underscore how processing algorithms often represent text using a data structure (word histograms or weighted vectors) where the original order of the words in sentence form is stripped away. While "bag of words" might well serve as a cautionary reminder to programmers of the essential violence perpetrated to a text and a call to critically question the efficacy of methods based on subsequent transformations, the expression's use seems in practice more like a badge of pride or a schoolyard taunt that would go: Hey language: you're nothin' but a big BAG-OF-WORDS.</onlyinclude></div>
  
Line 4: Line 6:
 
[[author::Michael Murtaugh]]
 
[[author::Michael Murtaugh]]
  
DRAFT DOCUMENT
 
  
<blockquote>
+
== Bag of words ==
We must bring together a collection of machines which simultaneously or sequentially can perform the following operations: (1) The transformation of sound into  writing; (2) The reproduction of this writing in as many copies as are useful; (3) The creation of documents in such a way that each item of information has its own identity and, in its relationships with those items comprising any collection, can be retrieved as necessary; (4) A Classification number assigned to each item of information; the perforation of documents correlated with these numbers; (5) Automatic classification and filing of documents; (6) Automatic retrieval of documents for consultation and presented either direct to the enquirer or via machine enabling written additions to be made to them; (7) Mechanical manipulation
 
at will of all the listed items of information in order to obtain new combinations of facts, new
 
relationships of ideas, and new operations carried out with the help of numbers. The
 
technology fulfilling these seven requirements would indeed be a mechanical, collective
 
brain. <ref>Traite p. 391, via W. Boyd Rayward, ''Introduction of International organisation and dissemination of knowledge: selected essays of Paul Otlet''</ref>
 
</blockquote>
 
  
== Bag of words ==
+
In information retrieval and other so-called ''machine-reading'' applications (such as text indexing for web search engines) the term "bag of
 +
words" is frequently used to underscore how processing algorithms often
 +
represent text using data structures (word histograms or weighted
 +
vectors) where the original order of the words in sentence form is
 +
stripped away.
  
In text indexing and other machine reading applications (such as Google's core business of search) the term "bag of words" is frequently used to underscore how processing algorithms often represent text using a reductive data structure where the original order of the words in sentence form is stripped away. In a recent blog post, Michael Erasmus explains the technique in the context of "tf-idf":
+
In essence, "bag of words" is a process whereby a text is processed in a way that results in a kind of bar chart with words and their counts, or perhaps more to the point represented as a kind of "word cloud". But why? The point of bag of words is in its relationship to code two ways first in that it (1) is very straightforward to implement the translation from initial document to bag of words representation, and more significantly then (2) opens up a wide array of tools and techniques for further transformation and analysis of this and other documents. For instance, a number of libraries available in the booming field of "data sciences" work with "high dimension" vectors; bag of words is a way to transform a written document into a mathematical vector where each "dimension" corresponds to a unique word. While physically unimaginable and also comically abstract (Shakespeare's Macbeth is now a point in space with 14 million dimensions), from a formal mathematical perspective, it's quite a comfortable idea, and many complementary techniques (such as principle component analysis) exist to reduce the apparent resulting complexity.
http://michaelerasm.us/tf-idf-in-10-minutes/:
 
  
 
<blockquote>
 
<blockquote>
<p>First, let's just define what I mean with document. For our purposes, a document can be thought of all the words in a piece of text, broken down by how frequently each word appears in the text.</p>
+
A traditional algorithm could not compute the large itemsets in the lifetime of the universe. [...] Yet many data sets are difficult to mine because they have many frequently occurring items, complex relationships between the items, and a large number of items per basket. In this paper we experiment with word usage in documents on the World Wide Web (see Section 4.2 for details about this data set). This data set is fundamentally different from a supermarket data set. Each document has roughly 150 distinct words on average, as compared to roughly 10 items for cash register transactions. We restrict ourselves to a subset of about 24 million documents from the web. This set of documents contains over 14 million distinct words, with tens of thousands of them occurring above a reasonable support threshold. Very many sets of these words are highly correlated and occur often.
 +
 
 +
<ref>Dynamic Data Mining: Exploring Large Rule Spaces by Sampling; Sergey Brin and Lawrence Page, 1998; p. 2 http://ilpubs.stanford.edu:8090/424/</ref>
 +
 
 +
</blockquote>
  
<p>Say for example, you had a very simple document such as this quote:</p>
+
While "bag of words" might well serve as a cautionary reminder to
 +
programmers of the essential violence perpetrated to a text and a call
 +
to critically question the efficacy of methods based on subsequent
 +
transformations, the expression's use seems in practice more like a
 +
badge of pride or a schoolyard taunt that would go: Hey language: you're
 +
nothin' but a big BAG-OF-WORDS.
  
<blockquote>Just the fact that some geniuses were laughed at does not imply that all who are laughed at are geniuses. They laughed at Columbus, they laughed at Fulton, they laughed at the Wright brothers. But they also laughed at Bozo the Clown - Carl Sagan</blockquote>
+
In this way BOW celebrates a perfunctory step of "breaking" a text into a purer form amenable to computation, to stripping language of its silly redundant repetitions and foolishly contrived stylistic phrasings to reveal a purer inner essence.
  
<p>This structure is also often referred to as a Bag of Words. Although we care about how many times a word appear in a document, we ignore the order in which words appear.</p>
 
</blockquote>
 
  
While "bag of words" might well serve as a cautionary reminder to programmers of the essential violence perpetrated to a text and a call to critically question the efficacy of methods based on subsequent transformations, the expression's use seems in practice more like a badge of pride ... a schoolyard taunt that would go: Hey language: you're nothin' but a big BAG-OF-WORDS. In this way BOW celebrates the apparently perfunctory step of "breaking" a text into a purer form amenable to computation, to stripping language of its silly redundant repetitions and foolishly contrived stylistic phrasings to reveal its cleaner inner essence.
+
== Book of words ==
  
== Lieber's Standard Telegraphic Code Book (1896) ==
 
 
<gallery>
 
<gallery>
File:Liebers P1016847.JPG|A book of code
+
File:Liebers P1016847.JPG|A book of words: Lieber's Standard Telegraphic Code Book (initially published 1896), republished in various editions in the early 1900s
 
File:Liebers P1016869.JPG|Code words and phrases related to the word '''act'''
 
File:Liebers P1016869.JPG|Code words and phrases related to the word '''act'''
 
File:Liebers P1016859.JPG|A list of examples showing the code words and the resulting translation as phrases, the number of words in both cases is listed to underscore the relative compactness of the encoded representation
 
File:Liebers P1016859.JPG|A list of examples showing the code words and the resulting translation as phrases, the number of words in both cases is listed to underscore the relative compactness of the encoded representation
Line 42: Line 46:
 
<blockquote>
 
<blockquote>
 
[...] my focus in this chapter is on the inscription technology that grew parasitically alongside the monopolistic pricing strategies of telegraph companies: telegraph code books. Constructed under the bywords “economy,” “secrecy,” and “simplicity,” telegraph code books matched phrases and words with code letters or numbers. The idea was to use a single code word instead of an entire phrase, thus saving money by serving as an information compression technology. Generally economy won out over secrecy, but in specialized cases, secrecy was also important.
 
[...] my focus in this chapter is on the inscription technology that grew parasitically alongside the monopolistic pricing strategies of telegraph companies: telegraph code books. Constructed under the bywords “economy,” “secrecy,” and “simplicity,” telegraph code books matched phrases and words with code letters or numbers. The idea was to use a single code word instead of an entire phrase, thus saving money by serving as an information compression technology. Generally economy won out over secrecy, but in specialized cases, secrecy was also important.
 +
</blockquote>
 +
 +
<blockquote>
 +
The interaction between code and language shows a steady movement away from a human-centric view of code toward a machine-centric view, thus anticipating the development of full-fledged machine codes with the digital computer. <ref>Hayles</ref>
 
</blockquote>
 
</blockquote>
  
Line 48: Line 56:
 
</blockquote>
 
</blockquote>
  
On the shift to the "machine-centric":
+
What telegraph code books do is remind us of is the relation of language in general to economy. Whether they may be economies of memory, attention, costs paid to a telecommunicatons company, or in terms of computer processing time or storage space, encoding knowledge is a form of shorthand and always involves an interplay with what we then expect to perform or "get out" of the resulting encoding.
<blockquote>
 
The interaction between code and language shows a steady movement away from a human-centric view of code toward a machine-centric view, thus anticipating the development of full-fledged machine codes with the digital computer. <ref>Hayles</ref>
 
</blockquote>
 
  
On the relation to a *universal language*:
 
 
<blockquote>Along with the invention of telegraphic codes comes a paradox that John Guillory has noted: code can be used both to clarify and occlude (2010b). Among the sedimented structures in the technological unconscious is the dream of a universal language. Uniting the world in networks of communication that flashed faster than ever before, telegraphy was particularly suited to the idea that intercultural communication could become almost effortless. In this utopian vision, the effects of continuous reciprocal causality expand to global proportions capable of radically transforming the conditions of human life. That these dreams were never realized seems, in retrospect, inevitable. <ref>Hayles</ref>
 
<blockquote>Along with the invention of telegraphic codes comes a paradox that John Guillory has noted: code can be used both to clarify and occlude (2010b). Among the sedimented structures in the technological unconscious is the dream of a universal language. Uniting the world in networks of communication that flashed faster than ever before, telegraphy was particularly suited to the idea that intercultural communication could become almost effortless. In this utopian vision, the effects of continuous reciprocal causality expand to global proportions capable of radically transforming the conditions of human life. That these dreams were never realized seems, in retrospect, inevitable. <ref>Hayles</ref>
 
</blockquote>
 
</blockquote>
  
On the embodiment of receiving the codes:
+
Lieber's code encodes not language as a whole, but specific language related to the particular needs and conditions of its use. ... Lieber's code is a condensation of business communication... Specifics include a table of words describing the possible rise or fall of the price of coffee at Le Havre in increments of a quarter of a Franc, or ...
<blockquote>Once learned and practiced routinely, however, sound receiving became as easy as listening to natural-language speech; one decoded automatically, going directly from sounds to word impressions. A woman who worked on Morse code receiving as part of the massive effort at Bletchley Park to decrypt German Enigma transmissions during World War II reported that after her intense experiences there, she heard Morse code everywhere—in traffic noise, bird songs, and other ambient sounds—with her mind automatically forming the words to which the sounds putatively corresponded. Although no scientific data exist on the changes sound receiving made in neural functioning, we may reasonably infer that it brought about long-lasting changes in brain activation patterns, as this anecdote suggests. <ref>Hayles</ref>
+
 
</blockquote>
+
In addition, the advertisements lining the book give a rich accounting of the situation of the code's use (and the commercial use of telegraphy in general). Among the advertisements for alcohol, banking, law offices, and office equipments are several ads for gun powder and explosives, not (just) for weapons, but specifically for mining.
 +
 
 +
As the flip side of telegraphies role in communicating for safety reasons from ship to shore and ship to ship, commerical telegraphy provided a means of coordinating the "raw materials" being mined, grown, or otherwise extracted from colonial sources and shipped back for sale.  
  
On the so-called "mutilations" of messages:
 
<blockquote>If bodily capacities enabled the “miraculous” feat of sound receiving, bodily limitations often disrupted and garbled messages. David Kahn (1967) reports that “a telegraph company’s records showed that fully half its errors stemmed from the loss of a dot in transmission, and another quarter by the insidious false spacing of signals” (839). (Kahn uses the conventional “dot” here, but telegraphers preferred “dit” rather than “dot” and “dah” rather than “dash,” because the sounds were more distinctive and because the “dit dah” combination more closely resembled the alternating patterns of the telegraph sounder.) Kahn’s point is illustrated in Charles Lewes’s “Freaks of the Telegraph” (1881), in which he complained of the many ways in which telegrams could go wrong. He pointed out, for example, that in Morse code bad (dah dit dit dit [b] dit dah [a] dah dit dit [d]) differs from dead (dah dit dit [d] dit [e] dit dah [a] dah dit dit [d]) only by a space between the d and e in dead (i.e., _. . . . _ _ . . versus _. . . . _ _. .). This could lead to such confounding transformations as “Mother was bad but now recovered” into “Mother was dead but now recovered.” Of course, in this case a telegraph operator (short of believing in zombies) would likely notice something was amiss and ask for confirmation of the message—or else attempt to correct it himself.</blockquote>
 
  
What telegraph code books do is remind us of is the relation of language in general to economy. Whether they may be economies of memory, attention, costs paid to a telecommunicatons company, or in terms of computer processing time or storage space, encoding knowledge is a form of shorthand and always involves an interplay with what we then expect to perform or "get out" of the resulting encoding.
+
== Raw data now! ==
 +
<pre>
 +
[...]
 +
Tim Berners-Lee:
 +
Make a beautiful website, but first give us the unadulterated data, we want the data. We want unadulterated data. OK, we have to ask for raw data now. And I'm going to ask you to practice that, OK? Can you say "raw"?
  
 +
Audience: Raw.
  
== Google Tap ==
+
Tim Berners-Lee: Can you say "data"?
  
In a Google April fools "prank" (where fake product announcements are made each April 1st, reportedly the product of Google's famous "20%" time for "side" projects ) [http://www.forbes.com/sites/johnkotter/2013/08/21/googles-best-new-innovation-rules-around-20-time/]
+
Audience: Data.
  
Claiming to be developed by "Reed Morse", great grandson of Samuel Morse the developer of the telegraph.
+
TBL: Can you say "now"?
  
What's notable about Google's (mock) interface of telegraphy is that it although they cite people's frustrations with modern devices having"too many buttons" they end up presenting an interface with two when the telegraphic interface was just one, and was routinely operated "blind" while the operators eyes read the message and perhaps made notation on paper while. While made in jest, the misunderstanding is telling as the performative interface of the telegraphs single button, is translated to one where the essential and initial form of the message is symbolic, containing those two binary symbols "dot" and "dash".
+
Audience: Now!
  
== VODER ==
+
TBL: Alright, "raw data now"!
[[File:Voder03.jpg|thumb]]
 
[[File:Schematic-Circuit-of-the-VODER.jpeg|thumb]]
 
[[File:VODER-Worlds-Fair-Pamphlet.jpeg|thumb]]
 
  
At the 1940 New York World's Fair, the VODER speaking machine system was demonstrated in sensational fashion. The system was developed by Homer Dudley, an engineer at AT&T Bell labs.
+
[...]
  
https://www.youtube.com/watch?v=0rAyrmm7vv0
+
So, we're at the stage now where we have to do this -- the people who think it's a great idea. And all the people -- and I think there's a lot of people at TED who do things because -- even though there's not an immediate return on the investment because it will only really pay off when everybody else has done it -- they'll do it because they're the sort of person who just does things which would be good if everybody else did them. OK, so it's called linked data. I want you to make it. I want you to demand it. <ref>Tim Berners-Lee: The next web, TED Talk, February 2009 http://www.ted.com/talks/tim_berners_lee_on_the_next_web/transcript?language=en</ref>
 +
</pre>
  
(video)
 
  
It's far and away much more human sounding than any text to speech system of today. Why? Because of the way it's performed. Rather than starting from written language broken into approximate translation of phonetic fragments and then applying a slew of statistical and other techniques in an attempt to bring back some sense of the natural expression of a human voice, the voder system merely offers its user a palette of sounds and leaves it to the operator to perform them.
+
== Un/Structured ==
  
She saw me
+
<blockquote>
 +
<p>
 +
The World Wide Web provides a vast source of information of almost all types, ranging from DNA databases to resumes to lists of favorite restaurants. However, this information is often scattered among many web servers and hosts, using many different formats. If these chunks of information could be extracted from the World Wide Web and integrated into a structured form, they would form an unprecedented source of information. It would include the largest international directory of people, the largest and most diverse databases of products, the greatest bibliography of academic works, and many other useful resources. [...]
 +
</p>
 +
<p>
 +
2.1 The Problem<br>
 +
Here we define our problem more formally:<br>
 +
Let D be a large database of unstructured information such as the World Wide Web [...]
 +
<ref>
 +
Extracting Patterns and Relations from the World Wide Web, Sergey Brin,
 +
Proceedings of the WebDB Workshop at EDBT 1998,
 +
http://www-db.stanford.edu/~sergey/extract.ps
 +
</ref>
 +
</p>
 +
</blockquote>
  
Who saw you? <br>
+
== Un/Ordered ==
'''She''' saw me
 
  
Whom did whe see? <br>
+
In programming, I've encountered a recurring "problem" that's quite
She saw '''me'''
+
symptomatic. It goes something like this: you (the programmer) have
 +
managed to cobble out a lovely "content management system" (either from
 +
scratch, or using one of dozens of popular framewords) where author(s)
 +
(the client) can enter some "items" (for instance bookmarks) into a
 +
database. After this ordered items are automatically presented in list
 +
form (say on a starting page). The author: It's great, except... could
 +
this bookmark come before that one? The problem stems from the fact that
 +
the database ordering (a core functionality provided by any database)
 +
somehow applies a sorting logic that's almost but not quite right. A
 +
typical example is the sorting of names where details (where to place a
 +
name that starts with a Norwegian "Ø" for instance), are
 +
language-specific, and when a mixture of languages occurs, no single
 +
ordering is necessarily "correct". The (often) exascerbated programmer
 +
hastily adds an additional widget so that each item can also have an
 +
"order" (perhaps in the form of a date, or just some kind of
 +
(alpha)numerical "sorting" value) to be used to correctly order the
 +
resulting list. Now the author has a means, awkward and indirect but
 +
workable, to control the order of the presented data on the start page.
 +
But one might well ask, why not just edit the resulting listing as a
 +
document? This problem, in this and many variants, is widespread and
 +
reveals an essential backwardness that a particular "computer scientist"
 +
mindset relating to what constitutes "data" and in particular it's
 +
relationship to order that makes what might be a straightforward
 +
question of editing a document into an over-engineered database.
  
Did she see you or hear you? <br>
+
Recently working with Nikolaos Vogiatzis whose research explores
She '''saw''' me
+
playful and radically subjective alternatives to the list, Vogiatzis was
 +
struck by how from the earliest specifications of HTML (still valid
 +
today) have separate elements (OL and UL) for "ordered" and "unordered"
 +
lists.
  
== Extracting Patterns and Relations from the World Wide Web ==
+
<blockquote>
Sergey Brin, In Proceedings of the WebDB Workshop at EDBT 1998
+
<p>
http://www-db.stanford.edu/~sergey/extract.ps
+
The representation of the list is not defined here, but a bulleted
 +
list for unordered lists, and a sequence of numbered paragraphs
 +
for an ordered list would be quite appropriate. Other possibilities
 +
for interactive display include embedded scrollable browse panels.
 +
</p>
  
<blockquote>The World Wide Web provides a vast source of information of almost all types,
+
<p>List elements with typical rendering are:</p>
ranging from DNA databases to resumes to lists of favorite restaurants. However,
 
this information is often scattered among many web servers and hosts, using
 
many different formats. If these chunks of information could be extracted from
 
the World Wide Web and integrated into a structured form, they would form an
 
unprecedented source of information. It would include the largest international
 
directory of people, the largest and most diverse databases of products, the
 
greatest bibliography of academic works, and many other useful resources</blockquote>
 
  
<blockquote>2.1 The Problem
+
<pre>
Here we define our problem more formally:
+
UL                    A list of multi-line paragraphs, typically
Let D be a large database of unstructured information such as the World Wide Web
+
                      separated by some white space and/or marked
</blockquote>
+
                      by bullets, etc.
  
== Data mining pre-google ==
+
OL                      As UL, but the paragraphs are typically
 +
                      numbered in some way to indicate the order as
 +
                      significant.
 +
</pre>
  
<blockquote>A traditional algorithm could not compute the large itemsets in the lifetime of the universe. [...] Yet many data sets are difficult to mine because they have many frequently occurring items, complex relationships between the items, and a large number of items per basket. In this paper we experiment with word usage in documents on the World Wide Web (see Section 4.2 for details about this data set). This data set is fundamentally different from a supermarket data set. Each document has roughly 150 distinct words on average, as compared to roughly 10 items for cash register transactions. We restrict ourselves to a subset of about 24 million documents from the web. This set of documents contains over 14 million distinct words, with tens of thousands of them occurring above a reasonable support threshold. Very many sets of these words are highly correlated and occur often.<ref>Dynamic Data Mining: Exploring Large Rule Spaces by Sampling; Sergey Brin and Lawrence Page, 1998; p. 2 http://ilpubs.stanford.edu:8090/424/</ref>
+
<ref>Hypertext Markup Language (HTML): "Internet Draft", Tim
 +
Berners-Lee and Daniel Connolly, June 1993, http://www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt</ref>
 
</blockquote>
 
</blockquote>
  
== Raw data ==
+
Vogiatzis surprise lay in the idea of a list ever being considered
 +
"unordered" (or in opposition to the language used in the specification,
 +
for order to ever be considered "insignificant"). Indeed in it's
 +
suggested representation, still followed in a modern web browser, the
 +
only difference between the two visually is that UL list item are
 +
preceded by a bullet symbol, while OL items are numbered.
 +
 
 +
(Separation of content and representation)
  
Tim Berners Lee and the urge to "liberate your documents"
+
The idea of ordering runs deep in programming practice where essentially
 +
different data structures are employed depending on whether order is to
 +
be maintained. Many common data structures (the [hash
 +
table](https://en.wikipedia.org/wiki/Hash_table) or associative array
 +
for instance) are structured to offer other kinds of efficiencies (fast
 +
text-based retrieval for instance), at the expense of losing any
 +
original sequence (the keys of a hash table are ordered in an
 +
unpredictable way governed by the needs of that representation's
 +
particular implementation).
  
<blockquote>So, we're at the stage now where we have to do this -- the people who think it's a great idea. And all the people -- and I think there's a lot of people at TED who do things because -- even though there's not an immediate return on the investment because it will only really pay off when everybody else has done it -- they'll do it because they're the sort of person who just does things which would be good if everybody else did them. OK, so it's called linked data. I want you to make it. I want you to demand it. <ref>Tim Berners-Lee: The next web, TED Talk, February 2009 http://www.ted.com/talks/tim_berners_lee_on_the_next_web/transcript?language=en</ref></blockquote>
 
  
== Notes ==
+
== Parallels/Conclusion ==
  
 
Parallel shifts, telegraph and telephony a shift occurs from language as something performed by a human body, to becoming captured in code, and occurring at a machine scale. In document processing a similar shift occurs from language as writing to language as symbolic sets of information to be treated to statistical methods for ''extracting'' knowledge in the form of relationships.
 
Parallel shifts, telegraph and telephony a shift occurs from language as something performed by a human body, to becoming captured in code, and occurring at a machine scale. In document processing a similar shift occurs from language as writing to language as symbolic sets of information to be treated to statistical methods for ''extracting'' knowledge in the form of relationships.

Revision as of 03:37, 15 December 2015

(language is nothing but a bag of words)

In text indexing and other machine reading applications the term "bag of words" is frequently used to underscore how processing algorithms often represent text using a data structure (word histograms or weighted vectors) where the original order of the words in sentence form is stripped away. While "bag of words" might well serve as a cautionary reminder to programmers of the essential violence perpetrated to a text and a call to critically question the efficacy of methods based on subsequent transformations, the expression's use seems in practice more like a badge of pride or a schoolyard taunt that would go: Hey language: you're nothin' but a big BAG-OF-WORDS.
Michael Murtaugh


Bag of words

In information retrieval and other so-called machine-reading applications (such as text indexing for web search engines) the term "bag of words" is frequently used to underscore how processing algorithms often represent text using data structures (word histograms or weighted vectors) where the original order of the words in sentence form is stripped away.

In essence, "bag of words" is a process whereby a text is processed in a way that results in a kind of bar chart with words and their counts, or perhaps more to the point represented as a kind of "word cloud". But why? The point of bag of words is in its relationship to code two ways first in that it (1) is very straightforward to implement the translation from initial document to bag of words representation, and more significantly then (2) opens up a wide array of tools and techniques for further transformation and analysis of this and other documents. For instance, a number of libraries available in the booming field of "data sciences" work with "high dimension" vectors; bag of words is a way to transform a written document into a mathematical vector where each "dimension" corresponds to a unique word. While physically unimaginable and also comically abstract (Shakespeare's Macbeth is now a point in space with 14 million dimensions), from a formal mathematical perspective, it's quite a comfortable idea, and many complementary techniques (such as principle component analysis) exist to reduce the apparent resulting complexity.

A traditional algorithm could not compute the large itemsets in the lifetime of the universe. [...] Yet many data sets are difficult to mine because they have many frequently occurring items, complex relationships between the items, and a large number of items per basket. In this paper we experiment with word usage in documents on the World Wide Web (see Section 4.2 for details about this data set). This data set is fundamentally different from a supermarket data set. Each document has roughly 150 distinct words on average, as compared to roughly 10 items for cash register transactions. We restrict ourselves to a subset of about 24 million documents from the web. This set of documents contains over 14 million distinct words, with tens of thousands of them occurring above a reasonable support threshold. Very many sets of these words are highly correlated and occur often.

[1]

While "bag of words" might well serve as a cautionary reminder to programmers of the essential violence perpetrated to a text and a call to critically question the efficacy of methods based on subsequent transformations, the expression's use seems in practice more like a badge of pride or a schoolyard taunt that would go: Hey language: you're nothin' but a big BAG-OF-WORDS.

In this way BOW celebrates a perfunctory step of "breaking" a text into a purer form amenable to computation, to stripping language of its silly redundant repetitions and foolishly contrived stylistic phrasings to reveal a purer inner essence.


Book of words

Katherine Hayles devotes a chapter to telegraph code books: [2]

[...] my focus in this chapter is on the inscription technology that grew parasitically alongside the monopolistic pricing strategies of telegraph companies: telegraph code books. Constructed under the bywords “economy,” “secrecy,” and “simplicity,” telegraph code books matched phrases and words with code letters or numbers. The idea was to use a single code word instead of an entire phrase, thus saving money by serving as an information compression technology. Generally economy won out over secrecy, but in specialized cases, secrecy was also important.

The interaction between code and language shows a steady movement away from a human-centric view of code toward a machine-centric view, thus anticipating the development of full-fledged machine codes with the digital computer. [3]

Liebers P1016851.JPG
After July, 1904, all combinations of letters that do not exceed ten will pass as one cipher word, provided that it is pronounceable, or that it is taken from the following languages: English, French, German, Dutch, Spanish, Portuguese or Latin -- International Telegraphic Converence, July 1903

What telegraph code books do is remind us of is the relation of language in general to economy. Whether they may be economies of memory, attention, costs paid to a telecommunicatons company, or in terms of computer processing time or storage space, encoding knowledge is a form of shorthand and always involves an interplay with what we then expect to perform or "get out" of the resulting encoding.

Along with the invention of telegraphic codes comes a paradox that John Guillory has noted: code can be used both to clarify and occlude (2010b). Among the sedimented structures in the technological unconscious is the dream of a universal language. Uniting the world in networks of communication that flashed faster than ever before, telegraphy was particularly suited to the idea that intercultural communication could become almost effortless. In this utopian vision, the effects of continuous reciprocal causality expand to global proportions capable of radically transforming the conditions of human life. That these dreams were never realized seems, in retrospect, inevitable. [4]

Lieber's code encodes not language as a whole, but specific language related to the particular needs and conditions of its use. ... Lieber's code is a condensation of business communication... Specifics include a table of words describing the possible rise or fall of the price of coffee at Le Havre in increments of a quarter of a Franc, or ...

In addition, the advertisements lining the book give a rich accounting of the situation of the code's use (and the commercial use of telegraphy in general). Among the advertisements for alcohol, banking, law offices, and office equipments are several ads for gun powder and explosives, not (just) for weapons, but specifically for mining.

As the flip side of telegraphies role in communicating for safety reasons from ship to shore and ship to ship, commerical telegraphy provided a means of coordinating the "raw materials" being mined, grown, or otherwise extracted from colonial sources and shipped back for sale.


Raw data now!

[...]
Tim Berners-Lee:
Make a beautiful website, but first give us the unadulterated data, we want the data. We want unadulterated data. OK, we have to ask for raw data now. And I'm going to ask you to practice that, OK? Can you say "raw"?

Audience: Raw.

Tim Berners-Lee: Can you say "data"?

Audience: Data.

TBL: Can you say "now"?

Audience: Now!

TBL: Alright, "raw data now"! 

[...]

So, we're at the stage now where we have to do this -- the people who think it's a great idea. And all the people -- and I think there's a lot of people at TED who do things because -- even though there's not an immediate return on the investment because it will only really pay off when everybody else has done it -- they'll do it because they're the sort of person who just does things which would be good if everybody else did them. OK, so it's called linked data. I want you to make it. I want you to demand it. <ref>Tim Berners-Lee: The next web, TED Talk, February 2009 http://www.ted.com/talks/tim_berners_lee_on_the_next_web/transcript?language=en</ref>


Un/Structured

The World Wide Web provides a vast source of information of almost all types, ranging from DNA databases to resumes to lists of favorite restaurants. However, this information is often scattered among many web servers and hosts, using many different formats. If these chunks of information could be extracted from the World Wide Web and integrated into a structured form, they would form an unprecedented source of information. It would include the largest international directory of people, the largest and most diverse databases of products, the greatest bibliography of academic works, and many other useful resources. [...]

2.1 The Problem
Here we define our problem more formally:
Let D be a large database of unstructured information such as the World Wide Web [...] [5]

Un/Ordered

In programming, I've encountered a recurring "problem" that's quite symptomatic. It goes something like this: you (the programmer) have managed to cobble out a lovely "content management system" (either from scratch, or using one of dozens of popular framewords) where author(s) (the client) can enter some "items" (for instance bookmarks) into a database. After this ordered items are automatically presented in list form (say on a starting page). The author: It's great, except... could this bookmark come before that one? The problem stems from the fact that the database ordering (a core functionality provided by any database) somehow applies a sorting logic that's almost but not quite right. A typical example is the sorting of names where details (where to place a name that starts with a Norwegian "Ø" for instance), are language-specific, and when a mixture of languages occurs, no single ordering is necessarily "correct". The (often) exascerbated programmer hastily adds an additional widget so that each item can also have an "order" (perhaps in the form of a date, or just some kind of (alpha)numerical "sorting" value) to be used to correctly order the resulting list. Now the author has a means, awkward and indirect but workable, to control the order of the presented data on the start page. But one might well ask, why not just edit the resulting listing as a document? This problem, in this and many variants, is widespread and reveals an essential backwardness that a particular "computer scientist" mindset relating to what constitutes "data" and in particular it's relationship to order that makes what might be a straightforward question of editing a document into an over-engineered database.

Recently working with Nikolaos Vogiatzis whose research explores playful and radically subjective alternatives to the list, Vogiatzis was struck by how from the earliest specifications of HTML (still valid today) have separate elements (OL and UL) for "ordered" and "unordered" lists.

The representation of the list is not defined here, but a bulleted list for unordered lists, and a sequence of numbered paragraphs for an ordered list would be quite appropriate. Other possibilities for interactive display include embedded scrollable browse panels.

List elements with typical rendering are:

UL                    A list of multi-line paragraphs, typically
                      separated by some white space and/or marked
                      by bullets, etc.

OL                      As UL, but the paragraphs are typically
                      numbered in some way to indicate the order as
                      significant.

[6]

Vogiatzis surprise lay in the idea of a list ever being considered "unordered" (or in opposition to the language used in the specification, for order to ever be considered "insignificant"). Indeed in it's suggested representation, still followed in a modern web browser, the only difference between the two visually is that UL list item are preceded by a bullet symbol, while OL items are numbered.

(Separation of content and representation)

The idea of ordering runs deep in programming practice where essentially different data structures are employed depending on whether order is to be maintained. Many common data structures (the [hash table](https://en.wikipedia.org/wiki/Hash_table) or associative array for instance) are structured to offer other kinds of efficiencies (fast text-based retrieval for instance), at the expense of losing any original sequence (the keys of a hash table are ordered in an unpredictable way governed by the needs of that representation's particular implementation).


Parallels/Conclusion

Parallel shifts, telegraph and telephony a shift occurs from language as something performed by a human body, to becoming captured in code, and occurring at a machine scale. In document processing a similar shift occurs from language as writing to language as symbolic sets of information to be treated to statistical methods for extracting knowledge in the form of relationships.

Often when we speak of machine or computer in terms like machine reading or computer vision, we speak of a displacement of human labour ... often to apply condensed human labour typically embodied in the form of (trained) statistic models ... and a displacement of responsibility as values become encoded in the form of algorithms/software.

The interest in "machinic" (minimal human intervention) involves on first glance "machinic" in the traditional sense of automating labour, replacing the human work of categorizing with an automated process; in this way opening up the process to a larger quantity of pages and a range of "esoteric" topics which would not be possible to handle with traditional editorial processes. This "machinic" shift is a business model that learns to extract the value of web surfers behaviour; this process is then echoed in google's book digitization which similarly "extracts" / exploits the value of the collection librarian (on top of the work of the author, the typesetter, the publisher)

The computer scientists view of textual content as "unstructured", be it in a webpage or the pages of a scanned text, underscore / reflect the negligence to the processes and labor of writing, editing, design, layout, typesetting, and eventually publishing, collecting and cataloging. (cf here [7]?)

In other words, by "unstructured" it is meant: unstructured in relation to the machine -- that is, not explicitly structured in a format directly amenable to use by automated means. "Structuring" then is a process by which structure is made explicit through the use of standards of markup (such as HTML/XML). In this way, the computer scientist is viewing a text through the eyes of their reading algorithm, and in the process (voluntarily) blinding themselves to the work practices which have produced, and maintain, the given textual resources, choosing to view them as instead somehow "freely given" and available to exploit as a "raw material".
  1. Dynamic Data Mining: Exploring Large Rule Spaces by Sampling; Sergey Brin and Lawrence Page, 1998; p. 2 http://ilpubs.stanford.edu:8090/424/
  2. "Technogenesis in Action: Telegraph Code Books and the Place of the Human", How We Think: Digital Media and Contemporary Technogenesis, 2006
  3. Hayles
  4. Hayles
  5. Extracting Patterns and Relations from the World Wide Web, Sergey Brin, Proceedings of the WebDB Workshop at EDBT 1998, http://www-db.stanford.edu/~sergey/extract.ps
  6. Hypertext Markup Language (HTML): "Internet Draft", Tim Berners-Lee and Daniel Connolly, June 1993, http://www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt
  7. http://informationobservatory.info/2015/10/27/google-books-fair-use-or-anti-democratic-preemption/#more-279