Deidentifying texts for PSLW
Finals week had just begun here at Purdue when the Crow team gathered in Heavilon Hall to kick off our summer projects. We met for some early morning sweets and some very much needed coffee to get our brains working before diving into our work. The team was assigned various tasks and dispersed. After touching base with other team members to ensure that everyone was on the same page, the bulk of the work was dedicated to de-identifying previously collected data.
Crow is built on the Purdue Corpus of Second Language Writing (PSLW), which is a collection of student-produced documents from the ENGL 106i courses here at Purdue. Before uploading these documents into the corpus, the documents must be de-identified. So, we split up into groups and we each tackled a group of documents. We reviewed each document and redacted any information that could lead to the identification of the writer, including any names, locations such as hometown or dorm halls, specific course names, and specific professor names. Rather than just deleting the identifying word or words, we replaced each one with angle brackets and the category we were replacing. For instance, a name such as “Jordan” is replaced with “<name>”. This prevents any confusion that missing words may cause.
De-identifying the documents, though tedious and mind-numbing, is an important step in our process. At this point, we want to look for themes that spread across multiple documents, not focus on certain documents individually. That being said, the specific, identifying detail that writers may have included in their assignments become irrelevant. We also want to work to ensure that we are not creating any biases based on preexisting knowledge of who the writer is of any of the documents we are examining.
Even though we have a lot left on our to-do list, we are excited to dive in and get to work on our summer projects, and we are looking forward to the progress Crow will make during the upcoming months! We’ll be presenting at Computers & Writing 2016, and we have a lot of prototyping and design work planned. Time to get some more coffee!