A Book of the Web

From Mondothèque

Dušan Barok

Is there a vital difference between publishing in print versus online other than reaching different groups of readers and a different lifespan? Both types of texts are worth considering preserving in libraries. The online environment has created its own hybrid form between text and library, which is key to understanding how digital text produces difference.

Historically, we have been treating texts as discrete units, that are distinguished by their material properties such as cover, binding, script. These characteristics establish them as either a book, a magazine, a diary, sheet music and so on. One book differs from another, books differ from magazines, printed matter differs from handwritten manuscripts. Each volume is a self-contained whole, further distinguished by descriptors such as title, author, date, publisher, and classification codes that allow it to be located and referred to. The demarcation of a publication as a container of text works as a frame or boundary which organises the way it can be located and read. Researching a particular subject matter, the reader is carried along by classification schemes under which volumes are organised, by references inside texts, pointing to yet other volumes, and by tables of contents and indexes of subjects that are appended to texts, pointing to places within that volume.

So while their material properties separate texts into distinct objects, bibliographic information provides each object with a unique identifier, a unique address in the world of print culture. Such identifiable objects are further replicated and distributed across containers that we call libraries, where they can be accessed.

The online environment however, intervenes in this condition. It establishes shortcuts. Through search engine, digital texts can be searched for any text sequence, regardless of their distinct materiality and bibliographic specificity. This changes the way they function as a library, and the way its main object, the book, should be rethought.

(1) Rather than operate as distinct entities, multiple texts are simultaneously accessible through full-text search as if they are one long text, with its portions spread across the web, and including texts that had not been considered as candidates for library collections.

(2) The unique identifier at hand for these text portions is not the bibliographic information, but the URL.

(3) The text is as long as web-crawlers of a given search engine are set to reach, refashioning the library into a storage of indexed data.

These are some of the lines along which online texts appear to produce difference. The first contrasts the distinct printed publication to the machine-readable text, the second the bibliographic information to the URL, and the third the library to the search engine.

FS: Maar het gaat toch ook over de manier waarop jullie toegang bieden, de bibliotheek als interface? Online laten jullie dat nu over aan Google.

SVP: De toegang gaat niet meer over: “deze instelling heeft dit, deze instelling heeft iets anders”, al die instellingen zijn via dezelfde interface te bereiken. Je kan doorheen al die collecties zoeken en dat is ook weer een stukje van die originele droom van Otlet en Vander Haeghen, het idee van een wereldbibliotheek. Voor elk boek is er een gebruiker, de bibliotheek moet die maar gaan zoeken.

Wat ik intrigerend vind is dat alle boeken één boek geworden zijn doordat ze op hetzelfde niveau doorzoekbaar zijn, dat is ongelooflijk opwindend. Dat is een andere manier van lezen die zelfs Otlet zich niet had kunnen voorstellen. Ze zouden zot worden moesten ze dit weten.

The introduction of full-text search has created an environment in which all machine-readable online documents in reach are effectively treated as one single document. For any text-sequence to be locatable, it doesn't matter in which file format it appears, nor whether its interface is a database-powered website or mere directory listing. As long as text can be extracted from a document, it is a container of text sequences which itself is a sequence in a 'book' of the web.

Even though this is hardly news after almost two decades of Google Search ruling, little seems to have changed with respect to the forms and genres of writing. Loyal to standard forms of publishing, most writing still adheres to the principle of coherence, based on units such as book chapters, journal papers, newspaper articles, etc., that are designed to be read from beginning to end.

Still, the scope of textual forms appearing in search results, and thus a corpus of texts in which they are being brought into, is radically diversified: it may include discussion board comments, product reviews, private e-mails, weather information, spam etc., the type of content that used to be omitted from library collections. Rather than being published in a traditional sense, all these texts are produced onto digital networks by mere typing, copying, OCR-ing, generated by machines, by sensors tracking movement, temperature, etc.

Even though portions of these texts may come with human or non-human authors attached, authors have relatively little control over discourses their writing gets embedded in. This is also where the ambiguity of copyright manifests itself. Crawling bots pre-read the internet with all its attached devices according to the agenda of their maintainers, and the decisions about which, how and to whom the indexed texts are served in search results is in the code of a library.

Libraries in this sense are not restricted to digitised versions of physical public or private libraries as we know them from history. Commercial search engines, intelligence agencies, and virtually all forms of online text collections can be thought of as libraries.

Acquisition policies figure here on the same level with crawling bots, dragnet/surveillance algorithms, and arbitrary motivations of users, all of which actuate the selection and embedding of texts into structures that regulate their retrievability and through access control produce certain kinds of communities or groups of readers. The author's intentions of partaking in this or that discourse are confronted by discourse-conditioning operations of retrieval algorithms. Hence, Google structures discourse through its Google Search differently from how the Internet Archive does with its Wayback Machine, and from how the GCHQ does it with its dragnet programme.

They are all libraries, each containing a single 'book' whose pages are URLs with timestamps and geostamps in the form of IP address. Google, GCHQ, JStor, Elsevier – each maintains its own searchable corpus of texts.
As books became more easily mass-produced, the commercial subscription libraries catering to the better-off parts of society blossomed. This brought the class aspect of the nascent demand for public access to books to the fore.

Puisqu'il était de plus en plus facile de produire des livres en masse, les bibliothèques privées payantes, au service des catégories privilégiées de la société, ont commencé à se répandre. Ce phénomène a mis en relief la question de la classe dans la demande naissante pour un accès public aux livres.

The decisions about who, to which sections and under which conditions is to be admitted are informed by a mix of copyright laws, corporate agendas, management hierarchies, and national security issues. Various sets of these conditions that are at work in a particular library, also redefine the notion of publishing and of the publication, and in turn the notion of public.

Corporate journal repositories exploit publicly funded research by renting it only to libraries which can afford it; intelligence agencies are set to extract texts from any moving target, basically any networked device, apparently in public interest and away from the public eye; publicly-funded libraries are being prevented by outdated copyright laws and bureaucracy from providing digitised content online; search engines create a sense of giving access to all public record online while only a few know what is excluded and how search results are ordered.

While schematic, scaling from the immediately practical, over strategic and tactical, to reflexive registers of knowledge, there are actual – here unnamed – people and practices we imagine we could be learning from.

Tout en restant schématique en allant de la pratique immédiate, à la stratégie, la tactique et au registre réflectif de la connaissance, il existe des personnes et pratiques - non citées ici - desquelles nous imaginons pouvoir apprendre.

It is within and against this milieu that libraries such as the Internet Archive, Wikileaks, Aaaaarg, UbuWeb, Monoskop, Memory of the World, Nettime, TheNextLayer and others gain their political agency. Their counter-techniques for negotiating the publicness of publishing include self-archiving, open access, book liberation, leaking, whistle-blowing, open source search algorithms and so on.

Digitisation and posting texts online are interventions in the procedures that make search possible. Operating online collections of texts is as much about organising texts within libraries, as is placing them within books of the web.


Originally written 15-16 June 2015 in Prague, Brno and Vienna for a talk given at the Technopolitics seminar in Vienna on 16 June 2015. Revised 29 December 2015 in Bergen.

Last Revision: 1·08·2016