A map tells a story, it gives a version of reality at a specific time and place, and contributes some information to that reality. More often than not, that information is quantitative in nature: it's about the number of people who do this or are that, or the volume of X or the amount of Y. This type of map helps us answer questions about the world, make comparisons between characteristics different places, and otherwise situate data within a place. For example, by collecting data on where readers of this post are located, we could make a map of readership, compare domestic and international, east coast and west coast, or urban and rural. This map would tell us something about you, the reader, but wouldn't give us any qualitative data or tell us about why you are reading this article. To gather that data, we could ask you to fill out a survey or leave a comment.
For the most part, these data types are kept separate, relegated to their respective domains, and used for different purposes. The quantitative tells us something about the world, addressing "how much," "how many," "when," and "where," and allows us to make beautiful thematic maps about people and places. But qualitative data contextualizes those answers by addressing "why" and "how." Qualitative data without quantitative is abstract and theoretical. Quantitative without qualitative tells an incomplete story about the world.
Attempts to use both types usually result in one of two approaches, either (a) anecdotes to contextualize quantitative data in an otherwise completely quantitative project, or (b) numbers are offered as evidence for explanations of the why and how. These are excellent rhetorical strategies, but in both cases, there is a primary project that is supplemented with secondary data.
Not only does this appear in research projects, but is amplified in when creating visualizations such as maps and graphs. Except in the case described above, quantitative and qualitative data are typically kept separate - relegated to different types of visualization techniques, and different types of maps.
There are plenty of guides on how to visualize quantitative data, including the work of Tufte, Few, and Cairo. There are fewer guides for visualizing qualitative data, though Henderson and Segal , and Miles and Huberman both offer suggestions. But these deal with one data type at a time.
Combining the data types without succumbing to awkward strategies to transform one into the other is a more challenging task. Knigge and Cope leverage a grounded theory, feminist approach to spatial visualization that purports using mixed data from the outset, from project design to analysis to visualization. Drawing connections between ethnographic research methods, and the iterative nature of visualization and design work, Knigge and Cope illustrate how qualitative and quantitative data can build on each other.
In our work on Urban Language Ecologies, we use a version of their methods for our Language Topographies project. We will explore how the two data types are essential for that project, and identity some of the reasons that it would be inappropriate in the Beyond the Census: Languages of Queens
and Beyond the Enclave: Situating Language in Place
projects. Finally, we will illustrate how the data types affect each other, and how that facilitates and inhibits their combination.
All of these projects are about people speaking languages in New York City. Language in particular has both quantitative (i.e., how many people speak a given language in a given area) and qualitative (i.e., the linguistic history of a person or group) dimensions. Understanding and working with the quantitative side of language data has some notorious problems. First, what counts as a language versus a dialect is a hotly contested issue with no clear answer beyond "how speakers identify themselves." And since there are not any homogenous speaker communities, that definition leads to as many disagreements as it resolves.
The second is that language questionnaires (such as the census) often assume bilingualism (English + Other Language) rather than multilingualism, which often more accurately reflects individuals' linguistic repetoire. Someone who speaks K'iche' (Guatemala), Spanish, and English will likely identify as someone who "speaks Spanish at home" in a language questionnaire rather than someone who "speaks K'iche' with most family, Spanish or Spanish and English with friends, and almost only English at work." Another example of this is how we talk about language. Someone who was born into a multilingual household and has moved a few times in their life may have a longer answer to the question: "What is your native language?" than someone born into a linguistically homogenous household and community.
These issues can be addressed with qualitative approaches to data, where individual stories can be told. But, this also poses practical problems as it obscures the linguistic diversity of a city. We can say that a city is linguistically diverse, but it is a hard statement to defend without demographic data of its inhabitants.
To reconcile this conflict and give the most accurate picture of the linguistic diversity of New York City, we have combined the quantitative and qualitative datasets where possible. In so doing, we experienced both successes and challenges, detailed here in a move to better understand the contexts in which this is not only appropriate, but necessary.
The Language Topographies project seeks to understand the relationship between spaces where a language is used and maintenance of that language in an urban setting. This project combines (quantitative) census data with (qualitative) information about languages practices in various spaces New York City. Preliminary work has been done to ascertain the value and strength of each place to the community. The remainder of this data will be collected through interviews with community members.
In this case, the quantitative census data can be combined with the qualitative location data points in part because they are about different entities: people and place. Both people and place are multidimensional: people have many different parts of their identities and places have many different types of uses. Because we are thinking about people and places, there is no confusion over measurement and visualization. That is, qualitative strategies can be used for the locations, and quantitative for the people. Using different visual tools highlights the fact that they are different data, and counting different things.
Critically, they share one dimension: language. If we think about each institution as having a collection of features such as [+permanent], [+public], [+Armenian speaking], and each person having a collection of features such as [+female], [+age 40-50], [+speaks Armenian speaking], we can align these very different data types on the features that they share in order to draw conclusions about the relationship between them.
Secondly, the relationship between [+Armenian speaking] for both the site and person has a story behind it. That story is essential to this project because it situates the role of the institution in the identity of its speakers, and helps us understand how the speakers value the sites.
The interviews use a grounded theory methodology, which may affect the strength of the sites of language use. Furthermore, as we discover more sites, it may shift the initial hypotheses. In this case, the project is built around the interaction between the qualitative and the quantitative data. That is, the abstract question we are asking is "how do individual sites affect the population, and how do individuals in the population value these sites?"
In this way, we were not seeking to combine different data types, but rather, the question we are asking about the world is a question of how the qualitative affects the quantitative. Presenting it in this way invites an iterative approach to the data, as we gather more, the project itself changes and evolves.
This approach was, of course, not without its challenges, and combining qualitative and quantitative may not always work. The Beyond the Enclave, Beyond the Census, and Language Topographies can all be described as how do we tell a story about the diversity of languages in New York City while still representing general trends.
All three projects use qualitative datasets, but only Language Topographies uses quantitative. As the name says, Beyond the Census: Languages of Queens does not include any census data. This project was intended to visualize the stories collected by the Endangered Language Alliance (ELA). The ELA is a small non-profit that has been involved in language documentation and education programs in New York City. They have become the de facto center for knowledge about smaller and less commonly researched languages in New York City. At the beginning of this project, they provided us with a spreadsheet of languages that they knew were spoken in Queens but not captured by the census. This data set is incredibly rich in its anecdotes and as a description of language in Queens, though it makes no claim to be a comprehensive list of languages spoken in Queens.
At many junctures, we wanted to combine this dataset with the census data about languages in Queens. We had hoped that by combining this point data (which sometimes represents an individual, sometimes a family, and sometimes whole communities of speakers) with census data, we could contextualize these languages with respect to larger, more wide-spread languages in the neighborhood. This presented a variety of challenges that prevented us from combining the two data types.
We came up against four primary problems that derive from measurement, relevance, incompleteness, and categorization.
The first is that when we add quantitative data to a qualitative map, the details that qualitative data highlights are necessarily obscured in a tabulated, qualitative dataset. One link between the languages the ELA recorded and the languages recorded by the census is through their lingua francas. Because many of the ELA languages are very small, they are associated with a larger language such as Spanish, or Bahasa Indonesia. This connection is relatively straightforward and uncontroversial in the case of languages such as K'iche' or Mixteco, that are confined to Spanish speaking countries such as Guatemala and Mexico. This is not possible for a language such as Garifuna, whose speakers originate from both Honduras and Belize (Spanish, and English speaking, respectively).
The second is that combining general quantitative data with specific quantitative data does not necessarily mean that the two datasets relate to each other. For example, we initially thought we could add a layer illustrating what percent of each neighborhood speaks a language other than English (LOTE) at home to illustrate the linguistic diversity that the individual languages exist in. However, in a NYC context, a measurement of LOTE speakers is in large part a measurement of how many people speak Spanish or Chinese.
The third problem is one of completeness. Another approach could be to tabulate languages based on region, so how many speakers of Asian, African, American, or other very broad groups. The problem here is that the tabulation itself invites skewed groupings. If we set aside the problem of what to do about languages that are spoken across the globe, we still have a problem of scale, where we know about many of the Asian language speakers, we know very little about the African language speakers. Without a complete dataset, this type of tabulation is inherently skewed.
Finally, there is an inherent problem of categorization in that qualitative and quantitative datasets are often categorized differently. In this instance, it emerges from how the census collects data versus how the ELA does. We could have taken the census data and counted how many different languages were represented in each tract to measure linguistic diversity of each area. Because the census must make arbitrary boundaries, they also include nine 'Other' categories. Since there is no way to tell if 25 'Other Indic Language' speakers all speak the same language, or 25 different languages, this type of grouping cannot be compared with a qualitative data set that focuses on those 'other' languages. In almost all quantitative datasets, there will be some data that simply does not fit into the categories, and this is where qualitative datasets excel since they can capture the categorical exceptions.
This list of problems is obviously not comprehensive, but begins to address why and how combining quantitative and qualitative datasets into any visualization is a challenge.
Many datasets have both a quantitative and qualitative component to them. Any demographic information describing people in a place categorizes features about those people such as age, gender, occupation, race or ethnicity into pre-defined categories. In creating the categories, artificial boundaries are drawn to quantify human characteristics. These boundaries allow one story to be told about the inhabitants of a city, but the grey areas are ignored. Qualitative datasets suffer from the opposite problem: in retaining the details and drawing our attention to the grey areas, and individual experiences, it can become impossible to make larger generalizations or speak to trends among the population.
Authors, designers, and visualization specialists often illustrate a quantitative story with qualitative examples, and approach that still keeps the datasets separate. Building on Knigge and Cope's work, what we are doing in the Langauge Topographies project seeks to integrate the qualitative into the quantitative to better understand how they relate. Through this, we hope to tell a multidimensional story about language in New York City, and highlight features that may otherwise be lost to the data.
Project Team: Laura Kurgan, Michelle McSweeney, Dare Brawley, Tola Oniyangi, and Carsten Rodin
Return to Case Studies