Difference between revisions of "A bag but is language nothing of words"

From Mondothèque

(VODER)
(VODER)
Line 79: Line 79:
 
She SAW me
 
She SAW me
  
x
 
  
(Example of not only order by expression).
 
  
  
 
[[category:publication]]
 
[[category:publication]]
 
[[author::Michael Murtaugh]]
 
[[author::Michael Murtaugh]]

Revision as of 17:03, 14 December 2015

In text indexing and other machine reading applications the term "bag of words" is frequently used to underscore how processing algorithms often represent text using a data structure (word histograms or weighted vectors) where the original order of the words in sentence form is stripped away. While "bag of words" might well serve as a cautionary reminder to programmers of the essential violence perpetrated to a text and a call to critically question the efficacy of methods based on subsequent transformations, the expression's use seems in practice more like a badge of pride or a schoolyard taunt that would go: Hey language: you're nothin' but a big BAG-OF-WORDS.

DRAFT DOCUMENT

Liebers's Standard Telegraphic Code Book (1896)

Katherine Hayles devotes a chapter entitled "Technogenesis in Action: Telegraph Code Books and the Place of the Human" in her book of 2006 How We Think: Digital Media and Contemporary Technogenesis

[...] my focus in this chapter is on the inscription technology that grew parasitically alongside the monopolistic pricing strategies of telegraph companies: telegraph code books. Constructed under the bywords “economy,” “secrecy,” and “simplicity,” telegraph code books matched phrases and words with code letters or numbers. The idea was to use a single code word instead of an entire phrase, thus saving money by serving as an information compression technology. Generally economy won out over secrecy, but in specialized cases, secrecy was also important.

The interaction between code and language shows a steady movement away from a human-centric view of code toward a machine-centric view, thus anticipating the development of full-fledged machine codes with the digital computer.

(example of the front page: Conventions to use more abstract language)

From the cover sheet of Liebers:

After July, 1904, all combinations of letters that do not exceed ten will pass as one cipher word, provided that it is pronounceable, or that it is taken from the following languages: English, French, German, Dutch, Spanish, Portuguese or Latin -- International Telegraphic Converence, July 1903
Along with the invention of telegraphic codes comes a paradox that John Guillory has noted: code can be used both to clarify and occlude (2010b). Among the sedimented structures in the technological unconscious is the dream of a universal language. Uniting the world in networks of communication that flashed faster than ever before, telegraphy was particularly suited to the idea that intercultural communication could become almost effortless. In this utopian vision, the effects of continuous reciprocal causality expand to global proportions capable of radically transforming the conditions of human life. That these dreams were never realized seems, in retrospect, inevitable.
Once learned and practiced routinely, however, sound receiving became as easy as listening to natural-language speech; one decoded automatically, going directly from sounds to word impressions. A woman who worked on Morse code receiving as part of the massive effort at Bletchley Park to decrypt German Enigma transmissions during World War II reported that after her intense experiences there, she heard Morse code everywhere—in traffic noise, bird songs, and other ambient sounds—with her mind automatically forming the words to which the sounds putatively corresponded. Although no scientific data exist on the changes sound receiving made in neural functioning, we may reasonably infer that it brought about long-lasting changes in brain activation patterns, as this anecdote suggests.
If bodily capacities enabled the “miraculous” feat of sound receiving, bodily limitations often disrupted and garbled messages. David Kahn (1967) reports that “a telegraph company’s records showed that fully half its errors stemmed from the loss of a dot in transmission, and another quarter by the insidious false spacing of signals” (839). (Kahn uses the conventional “dot” here, but telegraphers preferred “dit” rather than “dot” and “dah” rather than “dash,” because the sounds were more distinctive and because the “dit dah” combination more closely resembled the alternating patterns of the telegraph sounder.) Kahn’s point is illustrated in Charles Lewes’s “Freaks of the Telegraph” (1881), in which he complained of the many ways in which telegrams could go wrong. He pointed out, for example, that in Morse code bad (dah dit dit dit [b] dit dah [a] dah dit dit [d]) differs from dead (dah dit dit [d] dit [e] dit dah [a] dah dit dit [d]) only by a space between the d and e in dead (i.e., _. . . . _ _ . . versus _. . . . _ _. .). This could lead to such confounding transformations as “Mother was bad but now recovered” into “Mother was dead but now recovered.” Of course, in this case a telegraph operator (short of believing in zombies) would likely notice something was amiss and ask for confirmation of the message—or else attempt to correct it himself.

What telegraph code books do is remind us of is the relation of language in general to economy. Whether they may be economies of memory, attention, costs paid to a telecommunicatons company, or in terms of computer processing time or storage space, encoding knowledge is a form of shorthand and always involves an interplay with what we then expect to perform or "get out" of the resulting encoding.


Google Tap

In a Google April fools "prank" (where humorous "fake product" announcements are made each April 1st, reportedly the product of Google's famous "20%" time for "side" projects ) [1]

<clip>

Claiming to be developed by "Reed Morse", great grandson of Samuel Morse the developer of the telegraph.

What's notable about Google's (mock) interface of telegraphy is that it although they cite people's frustrations with modern devices having"too many buttons" they end up presenting an interface with two when the telegraphic interface was just one, and was routinely operated "blind" while the operators eyes read the message and perhaps made notation on paper while. While made in jest, the misunderstanding is telling as the performative interface of the telegraphs single button, is translated to one where the essential and initial form of the message is symbolic, containing those two binary symbols "dot" and "dash".

TD-IDF

In text indexing and other machine reading applications (such as Google's core business of search) the term "bag of words" is frequently used to underscore how processing algorithms typically represent text using a reductive data structure where the original order of the words in sentence form is stripped away. In a recent blog post, Michael Erasmus explains the technique "in 10 minutes": http://michaelerasm.us/tf-idf-in-10-minutes/:

First, let's just define what I mean with document. For our purposes, a document can be thought of all the words in a piece of text, broken down by how frequently each word appears in the text.

Say for example, you had a very simple document such as this quote:

Just the fact that some geniuses were laughed at does not imply that all who are laughed at are geniuses. They laughed at Columbus, they laughed at Fulton, they laughed at the Wright brothers. But they also laughed at Bozo the Clown - Carl Sagan

This structure is also often referred to as a Bag of Words. Although we care about how many times a word appear in a document, we ignore the order in which words appear.

While "bag of words" might well serve as a cautionary reminder to programmers of the essential violence perpetrated to a text and a call to critically question the efficacy of methods based on subsequent transformations, the expression's use seems in practice more like a badge of pride ... a schoolyard taunt that would go: Hey language: you're nothin' but a big BAG-OF-WORDS. In this way BOW celebrates the apparently perfunctory step of "breaking" a text into a purer form amenable to computation, to stripping language of its silly redundant repetitions and foolishly contrived stylistic phrasings to reveal its cleaner inner essence.


VODER

In a fantastic demonstration at the 1939 New York World's Fair, the VODER speaking machine system was demonstrated in sensational fashion. (VID CAP FROM YOUTUBE: Homer Dudley at Bell Labs and demonstrated at both the 1939 New York World's Fair and the 1939 Golden Gate International Exposition)

https://www.youtube.com/watch?v=0rAyrmm7vv0

(video)

It's far and away much more human sounding than any text to speech system of today. Why? Because of the way it's performed. Rather than starting from written language broken into approximate translation of phonetic fragments and then applying a slew of statistical and other techniques in an attempt to bring back some sense of the natural expression of a human voice, the voder system merely offers its user a palette of sounds and leaves it to the operator to perform them.

She saw me

Who saw you? SHE Saw me

Whom did whe see? She saw ME.

Did she see you or hear you? She SAW meMichael Murtaugh