Recent corpus development

Building a corpus isn’t just a matter of collecting texts in a directory on a server. Crow team members are continually improving the code we use to process and de-identify contributed texts, building documentation to describe our approaches, and hosting workshops to help team members at our Arizona and Purdue sites become better corpus builders.

During Summer 2019, the Arizona team, led by Adriana Picoral, improved our existing corpus building scripts, and in some cases rewrote them to add headers, organize metadata, and perform de-identification (de-id) using Pandas (a package in Python that allows users to manipulate data). De-identifying texts is necessary to ensure our participants’ privacy, and our process includes both machine- and human-performed removal of names and other potentially identifying information. This identifiable information is replaced in the text by tags such as <name>. Based on these changes, Picoral and Aleksey Novikov have added documentation on running scripts using both Windows and Mac OS platforms.

After each new script was ready, Picoral led a series of workshops helping the Arizona Crow team learn how to run these scripts, with Purdue researchers joining remotely. Most of the participants had not used the command line before, so that was an enriching experience. The process of running these scripts on different computers and platforms also helped us identify and troubleshoot various issues, which in turn, helped us update our documentation.

Through an iterative process of randomly selecting data for manual de-identification and logging issues that Crow researchers discovered as they de-identified texts, different regular expression patterns were added to the de-id scripts to remove as many student and instructor names as possible. Regular expressions are special combinations of wildcards and other characters which perform the sophisticated matching we need to accurately de-identify texts with automated processes. We decided to flatten all the diacritics with the cleaning script because it was easier to work with names that had been standardized to a smaller character set.

During the Fall semester, we have continued to improve our de-identification processes, using an interactive tool, also developed by Picoral. The tool highlights capitalized words in each text, making it easier to spot check for names that were not caught by the de-id script, such as students’ friends or family members whose names are not automatically included in our process. Each file from Fall 2017 and Spring 2018 was manually checked by the Arizona team that included Kevin Sanchez, Jhonatan Muñoz, Picoral, and Novikov. All in all, we processed 1,547 files, spending an average of 1.5 minutes checking each file.

Because we’ve developed it as a team, the de-identification tool is user-centered, allowing Crow researchers to more quickly and effectively find and redact names and other identifiable information. In the Crow corpus, these identifiers are replaced with tokens like <name>, <place>, and the like.

To increase the quality of the de-identification for previously processed data, both the Arizona team, led by Novikov, and the Purdue team, led by Ge Lan, performed additional searches on files that were already a part of the Crow corpus, using regular expressions with potential alterations in the instructors’ names. Running an additional script removed all the standalone lines which contained just the <name> tags and no other text. These files were updated in the corpus during our October 2019 interface update.

Screenshot of Crow de-identification tool showing a segment of text with de-identification targets highlighted and options for tokens to replace them visible.
De-identification tool developed by Crow researcher Adriana Picoral

Our next steps include replicating the processes for metadata processing and adding headers to text files at Purdue, which has a slightly different metadata acquisition process compared to Arizona given differences in the demographic data recorded by both institutions. We will also continue improving the interactive de-identification tool, so that it can eventually be released to a broader audience. Sharing our work in this manner not only helps other corpus builders, but gives us other sources of feedback which can help us keep building better tools for writing research.