logo with tag XXIX Annual Charleston Conference 5 Nov. 2009 Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC Research Global Publication Profiles: Books as an Expression of Cultural Diversity “The limits of my language mean the limits of my world.” – Ludwig Wittgenstein logo white small Overview Introduction to the Project Preliminary Results: •Publication profiles •Language data •Translation data Next steps logo with tag Introduction: Genesis and Background of the Project logo white small Genesis and Background of the Project Interested in measures of cultural diversity: •UNESCO Institute for Statistics •IFLA Statistics and Evaluation Committee •ISO Global Library Statistics Project •Book and Media Studies School, Leiden •Index Translationum •UNESCO Division of Cultural Expressions & Creative Industries •1932 - present The research I am sharing today begins with an information need: a number of bodies have found themselves jointly interested in "measures of cultural diversity." The idea here is that "book publishing represents a central kind of cultural heritage." The UNESCO Institute for Statistics has been exploring library statistics for worldwide book consumption, and helped to found the European Expert Meeting on Book and Library Statistics. These bodies, as well as the International Federation of Library Associations, are especially interested in any global patterns in the book publishing world as expressions of cultural diversity and heritage. [The players include: Mike Heaney, Executive Secretary Oxford Univ. Library Services/ Chair of ILFA Committee; Simon Ellis, Head of Culture/ Communications, UNESCO Inst. Stat. (Montreal); Mauro Rossi, UNESCO Division Cultural Expressions (Paris)] One expression of this interest in the IT, published by the UNESCO Division of Cultural Expressions - this print and (since 1979) digital resource tracks the translations of a culture's books into other languages – logo white small Genesis and Background of the Project So, if you want to know which of Shakespeare’s plays have been translated into Czech or Zulu… HOWEVER, this excellent tool is both limited in scope and in proprietary data formats – it has no ISBN for linking its data to other bibliographic tools! In addition, neither book publishing organizations, nor national libraries as a whole, nor library statistics agencies have been collecting data on book publishing on a GLOBAL scale. Then one day, Mike Heaney (U Oxford) noticed: logo white small Genesis and Background of the Project …the OCLC WorldMap. This is a prototype product of OCLC Research; Lynn Connaway, Larry Olszewski, Jeremy Browning, and I produced this as an exercise in data visualization. Using data from WorldCat and other print reference sources, one can compare bibliographic information country by country through the world, including - TITLES PUBLISHED, HOLDINGS WORLDWIDE OF THOSE TITLES, LIBRARIES IN THE COUNTRY AND THEIR TYPE, NATIONAL EXPENDITURES ON THOSE LIBRARIES, ARCHIVES MUSEUMS AND OTHER MEMORY INTITUTIONS, etc. HERE perhaps is the data that can satisfy the information need: logo white small Genesis and Background of the Project WorldCatTM and Global Book Data: •150 million records, 1.48 billion holdings •Strongest in monographic records •71,000 libraries •112 countries •> 50% non-English cataloging •> 470 languages • • [Statistics from http://www.oclc.org/us/en/worldcat/statistics (Nov, 1, 2009).] As most of you know, the WorldCat database is an increasingly global and increasingly comprehensive source of bibliographic data, and remains strongest in its data on books. Its member libraries are located in 112 countries, and the data goes beyond that to include works from those countries which are collected in other OCLC member libraries. More than 470 languages, and language is part of the definition UNESCO and OCLC were thinking when we developed "Books as an Expression of Cultural Diversity." So WorldCat, though limited to its Anglo-centric roots, may be seen as an ideal source for the very data needed to examine the question at hand. logo white small Project Objectives •Mine data from WorldCat’s monographic records • •Parse the data: •Country of publication •Year of publication •Language use as a “measure of cultural diversity” • •Seek patterns in the data The basic objectives of the project, then, are: To mine WorldCat's "overwhelmingly" monographic records On the importance of language, there is an axiomatic concept in Cognitive Anthropology that (in the words of Benjamin Lee Whorf), “Language shapes the way we think, and determines what we think about.” In other words, the language(s) spoken by a culture help to determine that culture’s perception of the world, and its expression of itself within that world. Whorf: “We dissect nature along lines laid down by our native language. Language is not simply a reporting device for experience but a defining framework for it.” Edward Sapir: “The worlds in which different societies live are different worlds, not merely the same world with different labels attached.” (Also Frank Boas, with somewhat different causality.) (Noam Chomsky: “Even the interpretation and use of words is a process of free creation.”) logo white small Project Objectives Data limits: •Textual material, non-serial •Publication date numeric, < 2010 Data extracted: •Published titles •Holdings – indigenous libraries and worldwide •Languages •Translation data • • Within this data mining, we set the following limits: Taking primarily textual material means records with leader 06=a and leader 07=a or m. We excluded serials, theses and dissertations, but deliberately cast an otherwise wide net. We exclude works with publication dates of, for instance, “19xx,” since we cannot fold them reliably into the rest of the data. [We also included books whose "date of publication" was as early as 1000 A.D., though these have some questions.] Titles - in FRBR terms, we are counting manifestations rather than works, based on the assumption that each new edition (of Shakespeare or the Bhagavad-Gita) is a fresh cultural artefact! Holdings - both overall holdings worldwide of the nation's cultural heritage, and a measure of how "foreign" libraries collect and value that cultural heritage. Languages - a reflection of cultural diversity in that it encompasses both "official" and "indigenous" languages, and others not necessarily native. Translations - includes measures of multiple linguistic content in a work. logo white small Project Objectives “Pre-Test” of the Procedures •Profiles of book publishing in six countries: •Bolivia, Chile, Germany, Poland, South Africa, Thailand •Expansions after technique refinement: •China, Columbia, Finland, France, India, Indonesia, Italy, Kenya, Nigeria, Russia, Ukraine, Venezuela •Australia, Belgium, Denmark, Egypt, Greece, Iran, Iraq, Ireland, Japan, Korea(s), Mexico, Netherlands, New Zealand, Norway, Spain, Sweden, Switzerland Today's report covers the early stages of data extraction. First, we did a “pre-test,” both on the technical procedures and the scope and comprehensiveness of the WorldCat data, for 6 countries. The countries for the test deliberately highlight non-English works and non-English cataloging, some mix of continents in the world, a mix of developed and developing nations, and a mix of OCLC representation in the database. It includes South Africa with its more complex set of "national" languages at play. We are continuing with two further groups of countries lasted below; to date we have mined some 46 million bibliographic records for these countries, and have processed the great majority of the non-US/UK publishing output reflected in WorldCat. [We are still shying away, however, from the data from the former Czechoslovakia!] logo white small Project Limitations •“As reflected in WorldCat” •Collection patterns •Cataloging practice •Definition of a book •Definition of a COUNTRY •Definition of a publishing date • Everything following is subject to the major caveat that the profile is “AS REFLECTED IN WORLDCAT.” This means that it is subject to what libraries – and specifically libraries that participate in WorldCat – have collected. For some countries such as Thailand and Bolivia in our early data, very few libraries in their country have been OCLC members, and thus the data may reflect more of what Anglo-American libraries have collected of their publications. Also, several issues of Anglo-American cataloging practice affect the following profiles. Catalogers may have different, or changing, concepts of what they will code as a “book.” Even more difficult for our purposes is the definition of a country. The German data below reflect catalogers’ judgments of several centuries of shifting boundaries that coalesce into what we in 2009 call “Germany.” [The publication profiles of the current Balkan nations may be forever corrupted by the presence of the former Czechoslovakia!] Finally, cataloging practice in the area of date of publication may vary, especially with re-printings of prominent historical works. Finally, I should say that I am NOT asserting historical causality in these profiles, merely pointing out how often the book publishing data happen to shift in conjunction with a country’s political history. Subject to that caveat, I think many of the profiles are actually quite revealing. logo with tag National Publishing Profiles logo white small Book Publishing Profiles: Publishing Output Here is the basic dataset for the pre-test, with the caveat that these are the records "as reflected in WC," and "as collected by OCLC member libraries." ["Do I believe that Bolivia has only published 58,000 books? Probably not. But…"] This gives an idea of the richness of the data possible to mine in WC, country by country. Germany, non unsurprisingly, is the richest subset of these data; its records include, pertinently, several large batchloads in recent years from the Deutsche NationalBibliothek, the Bayerische Staatsbibliothek, and HEBIS, the consortium of the largest university libraries in the Federal Republic. logo white small Book Publishing Profiles: Publishing Output … and for a sample of other countries’ data… Note that for how many countries we will be able to profile a fairly large set of publications throughout their history. Even in the case of a problematic country like a former Soviet republic, catalogers participating in WorldCat have paid attention to specifying publication in Tbilisi. logo white small Book Publishing Profiles: Publishing Output Here is a graphical example of the data within a subset - Germany in the 20th and 21st centuries. Note the historical dips in book publication from 1914-1919, and the collapse into 1946! [Again, I am not necessarily implying causality, just pointing out the obvious historical moment reflected in the data.] In the 1990s, after unification, it seems that German book publication itself did not wane, but merging of libraries and university systems within the unified nation led to fewer copies being held. logo white small Book Publishing Profiles: Publishing Output In this earlier dataset, it is possible to note reflections of the Napoleonic Wars, the tumultuous effects of the 1848 revolutions (which themselves follow a sharp build-up in publication during the revolutionary period), and the Thirty Years’ War (1618-1648). On the other hand, see the gains in publication during the reign of Bismarck (1861-1870), and back during the Protestant Reformation (1517ff.). This last could also reflect the historical importance of Reformation materials, which libraries would tend to heavily collect and preserve. logo white small Book Publishing Profiles: Publishing Output Compare that historical picture to this one from France, noting the incredible spike in the years 1789/1790, and different drops in the year of the 1848 revolution and 1871, the year the Prussians occupied Paris. [Earlier spike? Founding of the House of Bourbon?] logo white small Book Publishing Profiles: Publishing Output The 20^th-century data from Poland, as might be expected, do show a tremendous slump during the Nazi/ Soviet occupation, and a surprising slump in worldwide holdings in the late 1970s, followed by a surge in publishing after the fall of Communism. (NB The national holdings for Poland are in every year consistently lower than the total number of titles, as an indicator of lesser data from libraries in-country!) logo white small Book Publishing Profiles: Publishing Output For a country like South Africa, on the other hand, the WWII slump is far less of an issue. The major turning points in this profile coincide with the declaration of the Republic in 1961, perhaps the international pressure in the 1980s against the regime, and the continuing economic difficulties under ANC rule (1994+). logo white small Book Publishing Profiles: Language Data Turning to LANGUAGE DATA, here is a sample of published languages from three smaller countries. Note the presence (in WorldCat) of books published in indigenous languages such as Aymara, Quechua, and Guarani, though the majority in each country is published in the dominant language. (English works in each set could represent either English bias within WorldCat, or a dominant strain of English speaking, as in Chile.) logo white small Book Publishing Profiles: Language Data For the larger European countries, on the other hand, the book publishing seems to be more Eurocentric, with much less emphasis on minority "native" languages. Note, however, the strong presence of Latin (and, I would add, Greek, Hebrew, and languages such as Low German and Middle High and Old High German) in the data – a concrete reflection of libraries' function as historical memory institutions! logo white small Book Publishing Profiles: Language Data Here, too, the language data can be parsed out by years, with perhaps interesting nodes around the spike of German-language publications in 1913, and a surprising dip through the 1970s. [NB During the presentation to Members’ Council, several EMEA delegates expressed the need to collaborate with every national librarian to interpret a country’s bibliographic data…] logo white small Book Publishing Profiles: Language Data But a shape such as this one can be even more telling – Germany’s Latin-language publishing first spiked back in 1517 (Reformation), was generally strong (around 1000 titles/year) up to the 19^th century, only to suffer dips (perhaps in national approach to religion??) in the 1930s, and after 1968 (Vatican II). [No idea about the spike in 1972] logo white small Book Publishing Profiles: Language Data The data from South African book publishing, on the other hand, appropriately reflects the more complex linguistic heritage of that country - two dominant languages of the ex-colonial masters (and founders, no doubt, of much of the publishing industry) - but a very healthy dose of both translations into, and works original to, a variety of indigenous languages. logo white small Book Publishing Profiles: Translation Data And turning to the translation data, this can be the heart of the kind of cultural indicator UNESCO values. The data for the more culturally diverse country of South Africa presents a much more varied tapestry of translation data, including translations among English, German, and Afrikaans and a number of tribal languages. logo white small Book Publishing Profiles: Translation Data Note the spurt in translations from Zulu and Xhosa once the languages were decriminalized in 1991. logo white small Book Publishing Profiles: Translation Data Once again, the data for Poland and Germany reflect a more Eurocentric vision, with translations from a variety of major European languages into German and Polish responsible for the majority of the translations. (But again, translations into English from German and Polish also figure prominently in these WorldCat data; note also the Latin and Greek translations…) logo white small Book Publishing Profiles: Translation Data Similarly, for Poland, note the changing importance of Russian-Polish translations from 1950 to the present. logo white small Book Publishing Profiles: Translation Data Similarly, the data on translations for Bolivia, Chile, and Thailand gives a picture of both the predominant languages spoken in each country (Spanish and Thai), as well as the interaction of English and other languages with these languages. logo white small Book Publishing Profiles: Translation Data Here is an example of a lesser-populated dataset. logo white small Book Publishing Profiles: Translation Data For comparison, Italy’s publication is better-represented in WorldCat. Note the spike in German translations after unification in the 1990s, and the continuing strength of Latin. logo white small Next Steps •Final refinements to data extraction •Integrate OCLC Audience Level •Complete extraction of “foreign” holdings data •Check results against Index Translationum •Go global! • Refining the data extraction - 041$a and 041$h, both repeatable, and both containing potentially multiple linguistic contents, and both changing their definition in 1981, have been a challenge! We've solved most of the cases… Audience level will allow for another (OCLC Research-influenced) measure of the book production of a country. "Foreign" holdings should provide a different take on cultural production - how often do the books published as a reflection of the country's culture travel to other countries, and end up in the collections of other libraries? And… We plan to collect similar statistics for every country in the world (as reflected in WorldCat). We will also be in discussion as to potential longitudinal studies, re-checking these global data every few years to seek better data as more libraries join WorldCat, and to seek trends of globalization in book collecting. logo with tag Questions? Timothy J. Dickey, Ph.D. dickeyt@oclc.org www.oclc.org/research/projects/globalbooks/default.htm Special thanks to Jeremy Browning, Lynn Silipigni Connaway, Michael Heaney, and Karen Smith-Yoshimura