Ten million words!

This week, Crow researchers finalized the addition of 1,174 texts from Northern Arizona University to the Crow corpus. We’re thrilled to say this means we’ve hit two milestones:

Ten million words and ten thousand texts! To be exact, 10,905 and 10,155,120, respectively.

Why is this important? As we’ve previously shared, Crow members are constantly trying to improve the code used to process new files into the Crow dataset, and the addition of the NAU texts was another opportunity to improve our scripts and documentation. With the help of Adriana Picoral, Aleksey Novikov and Larissa Goulart have made changes to scripts we use to add demographic headers and de-identify texts, catered for the original NAU file structure, created by Shelley Staples and Randi Reppen in 2013–2014.

The NAU files were collected from English Composition classes taught between 2009 and 2012 to both L1 and L2 English students. Therefore, with the addition of these files, Crow now has L1 English assignments that can be explored through the Crow interface.

From a corpus linguistic perspective, this also means that now Crow contains a larger set of examples to identify patterns of learner language use. This is especially important for the study of word combinations, such as collocations and lexical bundles, as these combinations are identified based on frequency. 

Of course, this process wasn’t simple: each of the 1,174 texts had to be organized by course, assignment, first language (L1), and other metadata represented through shortcodes in each text’s filename—all part of Crow’s existing corpus design.

Subsequent steps in the preparation process were streamlined through automation tools the Crow team has developed. These include the ability to bulk convert files to plaintext format and remove non-ASCII characters, assist in de-identifying personal information, and to represent metadata in a machine-readable document header format. (These tools are open-source and available, and documenting how to use them is part of our ACLS-supported outreach work.)

Integrating the NAU texts alongside those from Purdue and Arizona also allowed us to navigate a common corpus-building challenge when materials are heterogeneously sourced: divergent metadata.

In particular, the NAU texts present information not yet represented in the other institutions’ texts —students’ L1–but simultaneously omit metadata for standardized test scores, college and program information, and gender identification.

Put one way, the Crow dataset is further evolving into a corpus consisting of multiple subcorpora.

So we had to take extra care that differences in the metadata were correct, rather than a result of miscategorization or human/machine error. We thus took this opportunity to build better auditing tools: we added a process for doing a “dry run” of the import of the texts into our online database which would report what new metadata would be added, as well as how many new texts were omitting metadata:

Screengrab clip of “dry run” for text processing with Crow corpus processing software. Shows computer program running at command line, ending in screen that reports database changes and the number of texts to be added to the corpus. 

From this report we could easily tick our acceptance criteria checkboxes (“Yes, we expect all 1,174 new texts not to have gender data”; “Yes, we expect a new category of L1 to be added”) before performing any database changes.

With the up-front work of standardizing the NAU texts to match Crow’s corpus design conventions, the final step of making those texts visible and searchable in our online interface was a (relative) snap. The consistent, machine-readable nature of the corpus records meant everything “just worked”!

Are you interested in using the Crow corpus for your research? Let us know!

Thank you to Larissa Goulart, Aleksey Novikov, Randi Reppen, Shelley Staples, Adriana Picoral and Mark Fullmer for helping us reach this important milestone, and Larissa, Shelley, Mark, and Bradley Dilger for this writeup.