A Map for Big Data Research in Digital Humanities

Kaplan, Frédéric

doi:10.3389/fdigh.2015.00001

FIELD GRAND CHALLENGE article

Front. Digit. Humanit., 06 May 2015

Volume 2 - 2015 | https://doi.org/10.3389/fdigh.2015.00001

A map for big data research in digital humanities

Frédéric Kaplan*

Digital Humanities Laboratory (DHLAB), École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

This article is an attempt to represent Big Data research in digital humanities as a structured research field. A division in three concentric areas of study is presented. Challenges in the first circle – focusing on the processing and interpretations of large cultural datasets – can be organized linearly following the data processing pipeline. Challenges in the second circle – concerning digital culture at large – can be structured around the different relations linking massive datasets, large communities, collective discourses, global actors, and the software medium. Challenges in the third circle – dealing with the experience of big data – can be described within a continuous space of possible interfaces organized around three poles: immersion, abstraction, and language. By identifying research challenges in all these domains, the article illustrates how this initial cartography could be helpful to organize the exploration of the various dimensions of Big Data Digital Humanities research.

Introduction: Big Data Digital Humanities vs. Small Data Digital Humanities

Defining the nature and the boundaries of digital humanities is a long-discussed and unsolved issue (Terras et al. 2013), not only because there is no consensus on this question but also because digital humanities are currently undergoing a profound transformation that calls for a reconsideration of its fundamental concepts (Gold 2012). For years, digital humanities have been loosely regrouping computational approaches of humanities research problems and critical reflections of the effects of digital technologies on culture and knowledge (Schreibman et al. 2008). Ten years ago, they emerged as a new label, rebranding and enlarging the idea of “humanities computing” (Svensson 2009). Around this new name and under a “big tent,” a progressively larger community of practice thrived (Terras 2011). Each work at the intersection of Computer Science and the Humanities could potentially be part of this welcoming trend. Researchers gathered in national and international meetings, exchanged their views on blogs and mailing lists. If not a well-bounded field, digital humanities were surely a lively conversation.

The welcoming digital humanities label opened doors, connected separated academic silos, built bridges between information sciences and the various disciplines loosely forming what is called the humanities. However, openness was always associated with a need for introspection, self-reflexive writings, tentative boundaries definitions, the “What are digital humanities” articles and monographs became a genre of its own structured around several narratives of exclusion and inclusion (Rockwell 2011). Digital humanities as a research domain define themselves dynamically in the negotiation of these tensions as discussed by several digital humanities scholars (Unsworth 2002; Svensson 2009; Rockwell 2011). Table 1 gives a non-exhaustive list of these structuring tensions.

TABLE 1

Table 1. Examples of structuring tensions defining digital humanities.

The starting point of this article is a relatively new particular structuring tension, opposing Big Data Digital Humanists with Small Data Digital Humanists. Research in Big Data Digital Humanities focuses on large or dense cultural datasets, which call for new processing and interpretation methods. The term Big Data itself has disputed origins (Diebold 2012; Lohr 2013). The Oxford English Dictionary defines it as “data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.” In that sense, Big Data are “big” when “manual” analysis becomes cumbersome and new study and interpretation methods must be invented. However, massiveness of Big Data is not tightly linked to a certain number of Terabytes. Boyd and Crawford (2011) note that “Big Data is not notable because of its size, but because of its relationality to other data.” Big Data is “fundamentally networked” and challenges in processing it are linked with its interconnected nature. In comparison, the Small Data Digital Humanities regroup more focused works that do not use massive data processing methods and explore other interdisciplinary dimensions linking computer science and humanities research. In comparison with Big Data, Small Data is small in the sense that it is not only smaller-scale but also well-bounded.

This article intends to draw a map for Big Data digital humanities showing how it can be organized as a structured field. The ambition of this map is to show that Big Data research in digital humanities can be characterized by common methodologies and objects of studies, therefore transcending some of the tensions that have structured digital humanities so far. As it focuses only on research that deals with these “large body of information” (Katz 2005), this maps does not cover the digital humanities domain as whole. Nevertheless, given the growing importance of massive and networked cultural datasets, it is likely that Big Data digital humanities become a significant part of the whole digital humanities field. In this context, this map may help institutionalize research and education programs with clearer focuses and objectives.

This article presents Big Data research in digital humanities as three concentric circles (Figure 1). The first circle corresponds to research focusing on processing and interpretation big and networked cultural data sets, the first object of study of this field. Most of the methods needed to study these datasets need still to be invented, as they are currently not mastered neither by humanists or computer scientists. However, it is important to consider that data processing and interpretation occur in a larger context of the new digital culture characterized by collective discourses, large community, ubiquitous software, and global IT actors. Understanding the relation between these entities could be considered the second object of study for Big Data Digital Humanities. Eventually, the human experience of such datasets through various kinds of interfaces corresponds to a third family of challenges, differing in scope and methodology from the other two. Therefore, these three areas of studies could be represented as three concentric circles, illustrating three levels of contextualization and embodiment of cultural data. In the next sections, we will briefly discuss each of the circles in more details.

FIGURE 1

Figure 1. The three circles illustrate three levels of contextualization and embodiment of big cultural data. The first circle contains research about large cultural databases and the new kind of understanding these databases enable. The second circle corresponds to research about the interdependency between collective discourse, large-scale communities, mediating software and global IT actors occurring in the context of what can be broadly called “Digital Culture.” The last circle contains research about new digital experiences, the actualization of big cultural dataset in the physical world. The challenges in each of these area can in turn be mapped using a linear scale (circle 1), a network of relations (circle 2), and a triangular continuous space (circle 3).

Big Cultural Datasets

Massive cultural digital objects include large-scale corpus like the millions of books scanned by Google and the ones produced by numerous other digitization initiatives (Jacquesson 2010), the millions of photos and micro-message shared on social network services (Thusoo et al. 2010), giant geographical information systems like Google Earth (Butler 2006), or the ever expanding networks of academic papers citing one another (Shibata et al. 2008). These interconnected objects – either digitally born or reconstructed through digitization pipelines – are too big to be read or watched. The traditional 1:1 ratio of a single scholar confronted with one document cannot cope with such abundance. Moreover, their boundaries are sometimes fuzzy, their content partially unknown and, likely to be in continuous expansion. These characteristics make them profoundly different from corpora traditionally studied by humanities researchers, despite surface resemblances.

The confrontation with these “massive” objects calls for fundamental questions. What can really be extracted from these huge datasets and what interpretations can be drawn based on these extractions? Will we learn more by analyzing 10 millions books that we cannot read individually or by reading five carefully (Moretti 2005)? What is the role of algorithms for mining, shaping, and representing these large digital objects?

Some of these challenges can be structured following the specific parts of data processing: digitization, transcription, pattern recognition, simulation and inferences, preservation, and curation as show in Figure 2 and in the Table 2 below. Each step in the data processing pipeline can be associated with questions that are both technical and epistemological. Consider the processing pipeline of mass book digitization projects. Physical books must be transformed into images (digitization step) that are then transformed into texts (transcription step), on which various pattern can be detected (pattern recognition step like text mining or n-gram approaches) or inferred (simulation step) while being preserved and curated for future research (preservation step). This way of presenting the research challenge insists on the fact that data are never given, but taken and transformed (Gitelman 2013). The technical complexity of pipelines involved clearly demonstrates that, at each step of the data processing, choices are made and biases apply. Understanding these technical choices is crucial to develop new interpretive theories.

FIGURE 2

Figure 2. Challenges can be structured following the data processing pipeline. At each step, technical challenges are met and choices are made.

TABLE 2

Table 2. Challenges in circle 1.

Digital Culture

We discussed the relationship between data processing pipelines and large cultural datasets. However, data processing and interpretation happen in a larger context, which we may call Digital Culture. The study of this large context can be considered to be the second object of study for digital humanities research. One way to structure this domain is to replace the relation between software and data (the focus of the first circle) in a network of relations between new entities including large-scale communities (MOOCs classrooms, Wikipedia contributors, etc.), collective discourses (Blogs, data journalism, wiki-style collaborative writing), ubiquitous software medium (auto-completion algorithm, search engine), and global actors (Google, Facebook, GLAM, Universities).

Consider the millions of photos shared every hour on Facebook (Huang et al. 2013). In this case, large-scale communities produce both the massive digital objects and the collective discourses about massive digital objects. They do so through the mediation of algorithms produced by one giant IT company of the web. Retroactively, collective discourses about the photos have a shaping role on the emergence and structuration of these communities. In addition, as collective discourses reach rapidly a critical mass (e.g., millions of messages or status update) they tend to become themselves massive digital objects, to be archived and studied through specific text and data mining approaches. Understanding photo sharing implies understanding the complexity of this network of interactions.

More generally, research about digital culture can be segmented in subdomains corresponding to groups of relations between some of the entities we have been discussing. This structuration summarized in Table 3 and Figure 3, identifies five domains: the processing domain, the discursive domain, the social shaping domain, the algorithmic mediation domain, and the control domain. This grouping articulates differently the relations of Big Data Digital Humanities with traditional humanities and social sciences disciplines, not considering that digital history, digital sociology, etc., but a new segmentation of domains.

TABLE 3

Table 3. Challenges in circle 2.

FIGURE 3

Figure 3. One way of mapping research about Digital Culture is to consider the relationship between big cultural dataset, software medium, collective discourses, large-scale communities, and global actors. Five domains can be identified: the processing domain (already discussed), the discursive domain, the social shaping domain, the algorithmic mediation domain, and the control domain. The study of these domains offers alternative segmentation of the research area, not linked with traditional disciplines.

Digital Experiences

Big cultural data, and digital culture at large, are experienced in the real world through physical interfaces, websites and installations. They produce “experiences.” This third circle is an area of study on its own.

Some interfaces are essentially immersive, in the sense that they try to project the user into full-fledged environments (e.g., 3d Virtual World). Others provide users with synthetic data representations (e.g., network visualizations). Eventually, some interfaces are essentially linguistic allowing users to browse data via linguistic inputs (e.g., search engine). We can represent the space of possible interfaces with a triangle organized around these three summits (Figure 4). Conversational agents (e.g., SIRI) are in between the immersive and linguistics summits. Word clouds are in between abstract and linguistic summits. GIS interfaces can be sorted from the most abstract (Google maps, Open Street Map) to the most immersive (Google Street view). Augmented reality interfaces combine immersive, abstract, and linguistic dimensions. Each dimension of the interface space is associated with specific challenges, some of which are summarized in Table 4.

FIGURE 4

Figure 4. Inspired on Scott McCloud’s triangle typology (McCloud 1994), this triangle organizes the different forms of interfaces explored by Digital Humanities researchers and the Digital Culture at large in three dimension, immersive, linguistic, abstract.

TABLE 4

Table 4. Challenges in circle 3.

Conclusion

Research in Big Data in digital humanities is becoming a well-structured field with specific objects of study. In this article, we identified three concentric areas of study and discussed how challenges in each area could be mapped. We illustrated how challenges focusing on the processing and interpretations of large cultural datasets can be organized linearly following the data processing pipeline, how challenges concerning digital culture at large could be structured around a network of relations between the new entities that emerged with the digital revolution and eventually, how challenges dealing with the experience of digital data can be described using the continuous space of possible interfaces. There are surely other ways of mapping this emerging field and the suggested structuration could be certainly refined and amended. However, we hope that this initial cartography will help paving the road ahead, acting as an invitation for exploring further the idea of Big Data Digital Humanities as a structured field.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Antonacopoulos, Apostolos., and Downton, Andy C. 2007. Special issue on the analysis of historical documents. International Journal of Document Analysis and Recognition (IJDAR) 9:75–7. doi: 10.1007/s10032-007-0045-1

CrossRef Full Text | Google Scholar

Battelle, John. 2005. The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture. New York, NY: Portfolio.