The Crow corpus has thousands of undergraduate writing assignments written by real students. When these students agreed to have their writing become a part of our platform, they did so under the assumption all their identifying information would be wiped away before their writing becomes public.
In order to ensure all traces of the writer’s identity (as well as their teachers, classmates, etc.) have been removed, we go through each text replacing names, places, course names, positions, and more with placeholder tokens. To make this process less mind-numbing, our wonderful developers have devised a series of tools that automate parts of the process.
Crow developers wrote a script that automatically deletes the header of each document, since this will almost always have the student’s name, their teacher’s name and other identifying information.
Next, a second tool automatically highlights capitalized words within the document because they are likely identifying proper nouns. This allows the reviewer to spot them quickly and determine whether or not they need to be replaced. While it’s essential to get rid of identifying information, it’s also important to keep in as much detail as possible and avoid detracting from the writer’s original message and intention. This can sometimes be a balancing act for the reviewer.
Cut to 2022, and nearly four years of undergraduate writing (over 4,500 documents!) have piled up un-de-identified. At Purdue, 2022 undergraduate researchers now have more than our fair share of undergrad writing to read and de-identify. We’ve been steadily working through files with guidance from others on the Crow team. Even with the tools we’ve made, de-identification remains labor intensive, so we’re grateful for the grant support that’s made our current work possible.
In our next post, we’ll detail how our team has been working on improvements to the de-identification tool to make it more efficient and more accurately retain writers’ intentions.