Corpus and Repository of Writing

On Saturday, November 2nd, 2019 the Crow team swooped into beautiful Flagstaff to attend the AZTESOL 2019 conference. Presenting on the topic “Bridging L2 Writing and Academic Vocabulary Through Corpus-based Activities”, Dr. Shelley Staples first introduced the Crow project, followed by a workshop led by Aleksey Novikov, Ali Yaylali, and Emily Palese. Although they weren’t able to attend, the workshop was also shaped by valuable input from Dr. Adriana Picoral and Hannah Gill. Dr. Nicole Schmidt and David Marsh were also on hand to assist participants.

Ali Yaylali (right) and Aleksey Novikov (center) lead the DDL workshop

During the workshop, attendees were introduced to Data-Driven Learning (DDL), an inductive approach to language learning in which corpus data is used to provide language learners with authentic examples of language use and grammatical forms from other learners. After being guided through an example interactive DDL activity using scrollable concordance lines developed in Crow lab, attendees were invited to make their own activities using data from the CROW corpus. 

Example of the scrollable concordance lines for the Crow corpus used for the interactive DDL activity

After sharing their interactive DDL activity ideas, workshop participants were asked to share their valuable feedback with us by filling out a survey feedback form. Many thanks to everyone who attended!

Later in the day, Crow’s own Dr. Nicole Schmidt also presented her research on corpus-based pedagogy. In her presentation, “Lessons Learned: Reflecting on an Online Corpus-Based Pedagogy Workshop Series,” Nicole reflected on her experiences conducting a seven-week online corpus pedagogy workshop series for University of Arizona writing instructors. A number of these instructors chose to work with Crow to create activities for their first year writing classroom. The teachers especially appreciated that the Crow texts were representative of the texts that they assigned to their own students. Nicole has also recently successfully defended her doctoral dissertation. Congratulations!

Dr. Nicole Schmidt: Lessons Learned: Reflecting on an Online Corpus-Based Pedagogy Workshop

On the way back to their Tucson nest, the Crowbirds made a brief stop in Sedona to rest their wings and appreciate the beautiful and mystical landscape.

 From left to right, Emily Palese, Dr. Shelley Staples, Aleksey Novikov, and David Marsh

Image: Concordance lines embedded from Crow’s corpus, used in a Literacy Narrative activity

As the spring season brings about renewal, Crow is excited to share our efforts in working closely with instructors to develop pedagogical materials. Dr. Shelley Staples’ CUES (Center for University Educational Scholarship) project began in Fall 2019 as a series of focus groups designed to create a conversation around what sorts of materials second language writing instructors would find useful, and how Crow could incorporate samples of student writing from our corpus into those materials.

Image: Frequency information on transition phrases in Literacy Narratives from the Crow corpus

We then used that feedback to create materials using data from our corpus. In later focus groups, our team (Nina Conrad, Emily Palese, Aleks Novikov, Kevin Sanchez, and Alantis Houpt) presented instructors with the materials we designed for their feedback and eventual implementation in their classes. Working alongside the instructors gave us insight into their needs and allowed them to have input on the types of activities they would be using. The second language writing instructors were given a variety of activities related to major assignments, which they could choose from accordingly and incorporate into their instruction. Researcher Nina Conrad attended classes to observe as instructors implemented the Crow-based materials in their classrooms.

Image: List of most frequent words used in Genre Analysis papers

Now that instructors have implemented two sets of the materials Crow has crafted, we are surveying students and instructors to ask for their feedback. We have been learning a lot from observing how participating instructors incorporate corpus-based materials into their existing curriculum. Moving forward, we plan on implementing the feedback we receive from students and instructors into our future focus groups, during which we will continue to test and improve on the corpus-based pedagogical materials we create. Some of the materials we have created can be found on our Crow for Teachers site, alongside workshops from past conferences such as AZTESOL

The Crow team would like to congratulate Dr. Adriana Picoral, who recently defended her dissertation, L3 Portuguese by Spanish-English bilinguals: Copula construction use and acquisition in corpus data.

Dr. Adriana Picoral and her committtee. From left: Mike Hammond, Shelley Staples, Adriana Picoral, Ana Carvalho, and Peter Ecke.

Building a corpus isn’t just a matter of collecting texts in a directory on a server. Crow team members are continually improving the code we use to process and de-identify contributed texts, building documentation to describe our approaches, and hosting workshops to help team members at our Arizona and Purdue sites become better corpus builders.

During Summer 2019, the Arizona team, led by Adriana Picoral, improved our existing corpus building scripts, and in some cases rewrote them to add headers, organize metadata, and perform de-identification (de-id) using Pandas (a package in Python that allows users to manipulate data). De-identifying texts is necessary to ensure our participants’ privacy, and our process includes both machine- and human-performed removal of names and other potentially identifying information. This identifiable information is replaced in the text by tags such as <name>. Based on these changes, Picoral and Aleksey Novikov have added documentation on running scripts using both Windows and Mac OS platforms.

After each new script was ready, Picoral led a series of workshops helping the Arizona Crow team learn how to run these scripts, with Purdue researchers joining remotely. Most of the participants had not used the command line before, so that was an enriching experience. The process of running these scripts on different computers and platforms also helped us identify and troubleshoot various issues, which in turn, helped us update our documentation.

Through an iterative process of randomly selecting data for manual de-identification and logging issues that Crow researchers discovered as they de-identified texts, different regular expression patterns were added to the de-id scripts to remove as many student and instructor names as possible. Regular expressions are special combinations of wildcards and other characters which perform the sophisticated matching we need to accurately de-identify texts with automated processes. We decided to flatten all the diacritics with the cleaning script because it was easier to work with names that had been standardized to a smaller character set.

During the Fall semester, we have continued to improve our de-identification processes, using an interactive tool, also developed by Picoral. The tool highlights capitalized words in each text, making it easier to spot check for names that were not caught by the de-id script, such as students’ friends or family members whose names are not automatically included in our process. Each file from Fall 2017 and Spring 2018 was manually checked by the Arizona team that included Kevin Sanchez, Jhonatan Muñoz, Picoral, and Novikov. All in all, we processed 1,547 files, spending an average of 1.5 minutes checking each file.

Because we’ve developed it as a team, the de-identification tool is user-centered, allowing Crow researchers to more quickly and effectively find and redact names and other identifiable information. In the Crow corpus, these identifiers are replaced with tokens like <name>, <place>, and the like.

To increase the quality of the de-identification for previously processed data, both the Arizona team, led by Novikov, and the Purdue team, led by Ge Lan, performed additional searches on files that were already a part of the Crow corpus, using regular expressions with potential alterations in the instructors’ names. Running an additional script removed all the standalone lines which contained just the <name> tags and no other text. These files were updated in the corpus during our October 2019 interface update.

Screenshot of Crow de-identification tool showing a segment of text with de-identification targets highlighted and options for tokens to replace them visible.
De-identification tool developed by Crow researcher Adriana Picoral

Our next steps include replicating the processes for metadata processing and adding headers to text files at Purdue, which has a slightly different metadata acquisition process compared to Arizona given differences in the demographic data recorded by both institutions. We will also continue improving the interactive de-identification tool, so that it can eventually be released to a broader audience. Sharing our work in this manner not only helps other corpus builders, but gives us other sources of feedback which can help us keep building better tools for writing research.

Over the summer, researchers Ashley Velázquez, Michelle McMullin, and Bradley Dilger worked on a series of projects that grew out of Crow’s winning Humanities Without Walls (HWW) grant. As Crow is wrapping up our work with HWW, we have been able to reflect on the benefits and resources developed through our HWW Grad Lab Practicum. 

Assisted by the HWW grant, Crow has been allowed a space to re-contextualize internal documents and processes in a format better suited to outward facing audiences. In addition, HWW closeout has allowed Crowbird researchers to pursue a series of projects and develop publication strategies for the distribution of accessible and approachable information.

These projects grew out of internal documents adjacent to the HWW grant, but are now being developed into standalone, outward-facing documentation other research teams can consult.

Best practices

Table of contents of the updated document detailing Crow best practices.
Crow best practices table of contents.

Developing out of review and updates of internal documentation were a series of updated Crow best practices, an initiative led by Velázquez. The development of a set of Crow best practices dates back to 2016, but there was a need to update and streamline them based on the new directions of Crow research and collaboration. We’ve learned that regular reconsideration of our best practices offers many benefits. Much of the currently-existing best practice material was spread through various documents and locations, and streamlining them into a single document allows for an easily referenced and accessible guide for onboarding new Crow collaborators, additionally serving as a reference guide for ongoing collaboration. Review of our existing best practices, and reflection on our development processes, has also helped us realize that writing a Crow code of ethics will be useful as we build a community of Crow researchers and users.

Mentoring and Resources for Graduate Students

Arising from her experiences as an American Association of University Women (AAUW) fellow and the GLP’s focus on professional development and grant writing, Velázquez assembled resources in order to support Crow researchers in writing and applying for grants and fellowships. As of now, this document is still focused on supporting Crow fellowship writers—particularly GLP team members—though material from will be published to a general audience in the near future, helping future AAUW fellows benefit from Velázquez’s experiences.

White paper: Focus on the Practicum lab and a toolkit for building this model for other researchers

As a result of funding and research opportunities provided by HWW, Velázquez, McMullin, and Dilger are authoring a white paper focusing on iterative, collaborative work in the context of Crow’s sustainable, interdisciplinary research. This white paper will introduce elements of constructive distributed work and the model of the GLP to other researchers, sharing our approaches and continuing the conversation started with our 2019 Computers & Writing roundtable presentation.

On Friday, October 4th, the Arizona Crowbirds opened their nest for the public to visit. During the Open Lab students, instructors, and staff were invited to explore the lab and interact with the University of Arizona’s research team. Visitors also had the opportunity to interact with Crow’s online interface, diving into the Corpus and Repository through a hands-on experience. They learned more about the different research and outreach projects the lab is involved in for both Crow and MACAWS (Multilingual Academic Corpus of Assignments–Writing and Speech), our cousin corpus.

Gathering of people standing in a small office. At left, Shelley Staples demonstrates the Crow interface.
Shelley Staples (far left), David Marsh (center), and Jhonatan Henao Muñoz (far right) chatting with open house attendees.

Thank you to everyone who attended our Open House! The Arizona Crowbirds look forward to continuing to share their progress and hard work with the community. To access the online interface, visit

Jhonatan Henao Muñoz guides an attendee through the Crow corpus web interface.
Jhonatan Henao Muñoz (right) offering a guided tour of the Crow web interface.

This blog post was written by Emily Palese.

Planning: Spring 2019

In between processing students’ texts for the corpus, Hadi Banat, Hannah Gill, Dr. Shelley Staples and Emily Palese met regularly during Spring 2019 to strategize about expanding Crow’s repository. At the time, the repository had 68 pedagogical materials from Purdue University, but none from the University of Arizona and no direct connections between students’ corpus texts and the repository materials. 

Hadi lead our team’s exploration of how repository materials had been processed previously, including challenges they faced, solutions they found, and questions that remained unresolved. With this context, we used Padlet to brainstorm how we might classify new materials and what information we’d like to collect from instructors when instructors share their pedagogical materials.

A section of our collaborative Padlet mindmap

Once we had a solid outline, we met with instructors and administrators from the University of Arizona’s Writing Program to pitch our ideas. Finally, with their feedback, we were able to design an online intake form with categories that would be helpful for Crow as well as instructors, administrators, and researchers.

Pilot & Processing Materials: Summer 2019

To pilot the online intake survey, we asked 8 UA instructors and administrators to let us observe and record their experiences as they uploaded their materials. This feedback helped us make some important immediate fixes and also helped us consider new approaches and modifications to the form. Another benefit of piloting the intake form is that we received additional materials that we could begin processing and adding to the repository.

Before processing any UA repository materials, Hannah and Emily first reflected on their experiences processing corpus texts and discussed the documents that had helped them navigate and manage those processing tasks. With those experiences in mind, they decided to begin two documents for their repository work: a processing guide and a corresponding task tracker.

Processing Guide: “How to Prepare Files for the Repository

To create the processing guide, Hannah and Emily first added steps from Crow’s existing corpus guide (“How to Prepare Files for ASLW”) that would apply to repository processing. Using those steps as a backbone, they began processing a set of materials from one instructor together, taking careful notes of new steps they took and key decisions they made. 

At the end of each week, they revisited these notes and discussed any lingering questions with Dr. Staples. They then added in additional explanations, details, and examples so that the process could easily and consistently be followed by other Crowbirds in the future. 

The result was a 9-page processing guide with 12 discreet processing steps:

Task Tracker: “Repository Processing Checklist”

When they worked as a team processing corpus texts in Spring 2019, Hannah, Jhonatan, and Emily used a spreadsheet to track their progress and record notes about steps if needed. This was particularly helpful on days when they worked independently on tasks; the tracker helped keep all of the team members up-to-date on the team’s progress and aware of any issues that came up. 

With this in mind, Hannah and Emily created a similar task tracker for their repository processing work. The tracking checklist was developed alongside the processing guide so that it would have exactly the same steps. With identical steps, someone using the tracking checklist could refer to the processing guide if they had questions about how to complete a particular step. Once a step is completed, the Crowbird who finished the step initials the box, leaves a comment if needed, and then moves to the next step. 

Below is a screenshot of what part of the tracking checklist looks like.

Developing the processing guide and the corresponding checklist was an iterative process that often involved revisiting previously completed tasks to refine our approach. Eventually, though, the guide and checklist became clear, consistent, and sufficiently detailed to support processing a variety of pedagogical materials.


By the end of the summer, Hannah, Emily, and Dr. Staples successfully processed 236 new pedagogical materials from the University of Arizona and added them to Crow’s online interface. 

For the first time, Crow now has direct links between students’ texts in the corpus and pedagogical materials from their classes in the repository. This linkage presents exciting new opportunities for instructors, researchers, and administrators to begin large-scale investigations of the connections between students’ drafts and corresponding instructional materials!

The team is growing at the University of Arizona, as Crowbirds welcome visiting scholar David Marsh to the flock. He is an associate professor of English at the National Institute of Technology, Wakayama College, Japan, and is currently in Arizona for one year as a CESL/SLAT Visiting Scholar. David’s research interests are related to second language teaching and corpus analysis of technical/engineering English.

In his free time, David likes to play with his son and cook.

Welcome, David, we look forward to working more with you in the future!

Bradley Dilger and Hadi Banat attended the Council of Writing Program Administrators’ annual conference in Baltimore, Maryland, and conducted a workshop to introduce the Crow platform and its various uses to the CWPA audience. Participants explored multiple features of the Crow platform and reflected on its potential uses for their own research and writing programs. After Dr. Dilger introduced the Crow project, design practices, and the technical aspects of building and maintaining the interface, graduate dissertation fellow Hadi Banat discussed Crow’s adopted methods to collect corpus texts and repository pedagogical materials. Both Dilger and Banat led a guided tour of our web interface, provided ample time for hands-on exploration, and assisted workshop participants by answering queries during our extensive individual work time. Finally, participants reported on their experience interacting with the interface, provided feedback on their interface experience, and reflected on ways to utilize this resource in their own institutional contexts.

Banat describing our approach to collaboration

During our conversations with CWPA workshop participants, we discussed the following:

  • Multiple Word Handling feature (Contains any word or Contain all words) in corpus search and possible additional interface features 
  • Our GitHub tools related to processing and de-identifying student texts
  • Pedagogical material de-identification, ownership, and labor concerns
  • Usability of the repository materials and corpus texts for graduate student practicums
  • Coding multimodal digital projects and related repository backend work 
  • Open source platform, user permissions, and access to data 
  • Open source platform and access criteria pertaining to various user profiles

After our conversations, we invited workshop participants to share more feedback with us by filling out a survey feedback form. Inspired by user experience and usability practitioners, outreach workshops and user feedback are instrumental for continuing the development of our interface. Thanks to our ACLS extension grant, we were able to offer gift cards to participants who filled out the survey, another part of our outreach work to build a network of potential Crow contributors and researchers.

Dilger responding to workshop participant questions

In addition to the time we spent in sessions and networking with other scholars and peers, we did not forget to enjoy the scenic inner harbour of Baltimore and the multicultural cuisines in the city. Dilger went for sunrise runs before breakfast and conference talks, and Banat enjoyed sunset walks after long days at the conference. (At CWPA, breakfast starts at 6:45am!) 

Baltimore’s inner harbor at sunset

At the end of the conference, CWPA organizers took us on a trip to the American Visionary Art Museum where we enjoyed snacks, desserts, and beer before we took a tour of the museum and admired its unique pieces. Through this social event, we also met new friends and had entertaining conversations outside the realm of academia.

Pieces from the American Visionary Art Museum

We hope to attend CWPA 2020 in Reno, Nevada and share our work with the writing program administration community again. 

Our Crow mascot enjoying the conference

Hadi Banat designed and delivered two Crow outreach workshops: one for Computers and Writing 2019 in East Lansing, Michigan, and one at the Council Writing Program Administrators 2019 (CWPA) in Baltimore, Maryland. With the Transculturation team, he presented the project’s pilot results at CWPA, too.

With Emily Palese and Shelley Staples, Banat sent a proposal about Crow’s repository development for the TESOL Conference to take place in Colorado on March 2020. He mentored undergraduate interns in the Transculturation Lab and coded dissertation data. His article “English in Lebanon: Policy Making, Education, and User Agency” was one of nine articles to get published in the special Middle East North African (MENA) Issue of the World Englishes Journal.

Bradley Dilger attended C&W 2019 in East Lansing, presenting with Hadi Banat and Michelle McMullin about Crow’s HWW grant and sharing the Crow interface with new audiences with Emily Jones. He presented with Banat at CWPA 2019 in Baltimore as well.

Wendy Gao has been developing four new test forms for Purdue Language and Culture Exchange Program (PLaCE) since Summer 2018. After rounds of revisions and reorganization, all the new test forms are available for use; the new ACE-in test will be administered by PLACE to more than 600 incoming undergraduate students at Purdue this fall. 

Gao co-authored a book chapter with April Ginther, “L2 Speaking: Theory and Research,” submitting their latest draft in May. Her  proposal to MWaLT (Midwest Association of Language Testers) Conference 2019—“Concept Mapping for Guiding Rater Training in an ESL Elicited Imitation Assessment”—was accepted. It is a study about rater disagreement on evaluating the listen and repeat sentence items for ACE-in.

Ge Lan cooperated with two Purdue SLS students to publish an article in System, “Does L2 write proficiency influence noun phrase complexity? A case analysis of argumentative essays written by Chinese students in a first-year composition course.” The data was from the PSLW corpus (which is an old version of the Crow corpus). 

Jhonatan Henao-Munoz attended the ATISA Summer School in Translation and Interpreting and participated in a corpus reading group with Aleksy Novikov. He also participated in the internship on Digital Humanities and Podcasting from National Humanities Center at San Diego State University. 

Emily Palese and Hannah Morgan Gill piloted and made improvements to the repository intake form that developed with Banat last spring. They also collected and processed over 200 materials from UA instructors for the repository. Emily made a Processing Guide to help future Crowbirds follow our steps.

Adriana Picoral was selected as an Arizona Data Science Ambassador. She had two papers selected to NWAV48 (New Ways of Analyzing Variation), the flagship sociolinguistics conference. Adriana also dedicated a lot of time to Crow, processing files for both ASLW and MACAWS, and improving our processing methods. She’s most excited about working with Mark Fullmer and other Crowbirds to develop a tool that will improve of Crow’s integration of corpus and repository.

Samantha Pate Rappuhn, a former Purdue undergraduate researcher, was hired as Grants and Academic Projects Coordinator for Ivy Tech Community College, working on the Kokomo, Indiana campus.

Ji-young Shin had an internship during the summer where she participated in multiple research projects. Her book review on response processes validity is in print now (Language Testing). A book chapter on stance features (Crow-data-based research presented during the TaLC) is in review. She received a small external grant (Language Learning Dissertation Grant). Lastly, she was invited to review a couple peer-reviewed journals and made a debut as an official reviewer. 

Shelley Staples was accepted as a Center for Undergraduate Education Scholarship (CUES) Fellow, with $50,000 in funding to support the development and evaluation of corpus-based pedagogical materials for the SLW classes at UA in Summer 2019. She had three journal articles accepted, one with Ge Lan for Journal of Second Language Writing on complexity in SLW, one for Register Studies, a multidimensional analysis of healthcare communication, and one for Language Testing on a multidimensional analysis of L2 written assessment. A fourth paper is now in press for a Routledge book on triangulating research methods (corpus linguistics and assessment).

She presented on multimodality in Crow, with Jeroen Gevers at the Computers and Writing conference and three papers at the Corpus Linguistics conference in Cardiff, Wales: one paper on the MD analysis of healthcare (Register Studies article mentioned above), one on an MD analysis of spoken oral assessment (see book chapter above), and a third on L2 writing complexity in the British Academic Written English corpus (a follow up from a 2016 Written Communication article).

Shelley also began her duties as Associate Director of Second Language Writing in the UA Writing Program and created versions of two of L2 writing courses for the micro campus in Lima, Peru, with Chris Tardy.

David Stucker worked as an intern for an industrial welding application firm in Greenfield, Indiana. He developed documentation for the operation and installation of an automated spark plug ground electrode welding system under contract from Federal-Mogul in France. He also developed a set of standardized manual templates and wrote a company style guide.

Ali Yayali was hired as an RA for the education extension of a grant-funded (USDA) environmental science project (SBAR) and started working with a secondary science teacher. He will design literacy and writing lessons and activities and work with English Language Learners (ELLs) in science classrooms particularly. This will help him as an initial step towards solidifying his plans to focus on scientific writing by L2 writers, secondary school genres, and the language of science. Ali also helped the Arizona Crowbirds write a proposal for the AZTESOL Conference.