Corpus and Repository of Writing

Congratulations to the graduate and undergraduate Crowbirds who are earning degrees this academic year!

  • Hadi Riad Banat, PhD, English (Second Language Studies), Purdue
  • Bruna Somner Farias, PhD, Second Language Acquisition and Teaching, Arizona
  • Jie Wendy Gao, PhD, English (Second Language Studies), Purdue
  • Emily Jones, BA, Professional Writing, Purdue
  • Ge Lan, PhD, English (Second Language Studies), Purdue
  • Adriana Picoral, PhD, Second Language Acquisition and Teaching, Arizona
  • Nicole Schmidt, PhD, Second Language Acquisition and Teaching, Arizona
  • David Stucker, BA, Professional Writing, Purdue
  • Yiqiu Echo Yan, BA, Retail Management, Purdue

Next week, we’ll share more news about the accomplishments of everyone on our Crow team this past year. Well done, graduates!

We’re joining the Council of Undergraduate Research to celebrate #VirtualURW2020 by spotlighting some of the wonderful contributions undergraduate researchers have made to the Crow team.

Undergraduates have been active Crow researchers since our project began at Purdue in 2015. Some examples:

See our Twitter feed to learn more about the work Echo, Ryan, Kevin, David, Alantis, and Hannah have been doing for the Crow team. Thank you, undergraduate researchers!

Congratulations to Dr. Jie (Wendy) Gao (高洁), who recently defended her dissertation, “Linguistic Profiles of High Proficiency Mandarin and Hindi Second Language Speakers of English.” Dr. Gao’s committee was Purdue professors Dr. April Ginther, Dr. Elaine Francis, and Dr. Tony Silva, with Dr. Xun Yan from the University of Illinois Urbana-Champaign.

Dr. Wendy Jie Gao

At Purdue, Dr. Gao has been studying second language studies, language testing and corpus studies. She has been a member of Crow since she first met Dr. Staples in 2015, at the inception of the Crow program. In that time, she’s gotten through an entire PhD program—so needless to say, she’s made some big contributions to Crow throughout her tenure.

Dr. Gao’s linguistics background runs deep. She studied at Shandong University for her undergraduate education, majoring in Translation and Interpretation with a Minor in French. She continued her studies with a masters degree from Tsinghua University in Beijing, studying applied linguistics. At Purdue, her PhD studies and research have been in the areas of second language studies, language testing, and corpus studies. Between her time in mainland China, a summer semester in Taiwan, continued studies of French, and her experiences in the United States, Dr. Gao has experienced a multitude of different language environments. These diverse experiences have guided her thinking in her research positions.

Here in West Lafayette, she has held multiple positions as a research assistant. Dr. Gao began as a research assistant with Oral English Proficiency Program (OEPP) as a testing office assistant. Her job has been to evaluate the oral proficiency level of graduate students before they begin teaching in the classroom. She also served as a research assistant for the Purdue Language and Cultural Exchange (PLACE). These two experiences primed her for the Crow research environment.

Rounding out her experiences outside of Crow, Dr. Gao has seen linguistics through the eyes of the instructor. In her masters program, she was a teaching assistant, while here at Purdue, she has been an instructor for English 106, 106INTL, and 108 classes. Being an instructor can be more demanding and involved than rating papers as a research assistant, but Dr. Gao loves it nonetheless. While most of her teaching experience has been in writing, she hopes she’ll be able to teach in her true passion: linguistics.

In her time with Crow, Dr. Gao has truly run the gambit on her contributions. She was first drawn to Crow because of its overlap with her previous work in her masters program, where she developed an automatic scoring system for writing. In her tenure with Crow, she has focused on a few major projects and topics. She first focused on repository building, working to unify the corpus into a uniform format and standardize handling of pedagogical materials.

She has also spent a significant time becoming proficient at statistical programming, including SAS, SPSS, and Python. Dr. Gao has also written pedagogical papers for Crow and presented at conferences across the world, everywhere from Purdue to Birmingham, United Kingdom. And to top it all off, she has engrossed herself in grant writing. Right now, Dr. Gao spends her time in Crow serving as an external reader and provides feedback to the different groups within Crow.

Besides looking forward to postgraduate work, Dr. Gao shares her knowledge and experience with the newer Crowbirds by mentoring undergraduates in Crow at Purdue. She has worked with Echo Yan to host a workshop on Crow at a Purdue digital humanities symposium. Dr. Gao has also been working with David Stucker to find past winners of grants. This is a part of a crucial Crow mission to have graduate team members understand how to benefit from working with undergraduates.

Dr. Gao’s dissertation centers on an analysis of over 400 speech samples of higher level second language speakers of English from across East Asia. This work will go towards developing a more holistic analysis of English proficiency, incorporating vocabulary, speed, and importantly accentedness into the metric. Like fellow Crowbird Dr. Ge Lan, Dr. Gao plans to return to her native China soon, where she hopes to take a position in applied linguistics. A perfect scenario for her would be to study and teach linguistics back in China. We look forward to her continued work with our project!

Update, August 16: Dr. Gao has accepted a position in the Department of English at Fudan University in Shanghai. Congratulations!

We’re very glad to announce Crow researcher Dr. Ge Lan has successfully defended his dissertation, Noun Phrase Complexity, Academic Level, and First Language Background in Academic Writing. Dr. Lan’s committee was chaired by April Ginther and included Elaine Francis and Crowbirds Shelley Staples and Bradley Dilger.

Dr. Ge Lan outside the Wilmeth Active Learning Center (WALC) at Purdue University.

On Saturday, November 2nd, 2019 the Crow team swooped into beautiful Flagstaff to attend the AZTESOL 2019 conference. Presenting on the topic “Bridging L2 Writing and Academic Vocabulary Through Corpus-based Activities”, Dr. Shelley Staples first introduced the Crow project, followed by a workshop led by Aleksey Novikov, Ali Yaylali, and Emily Palese. Although they weren’t able to attend, the workshop was also shaped by valuable input from Dr. Adriana Picoral and Hannah Gill. Dr. Nicole Schmidt and David Marsh were also on hand to assist participants.

Ali Yaylali (right) and Aleksey Novikov (center) lead the DDL workshop

During the workshop, attendees were introduced to Data-Driven Learning (DDL), an inductive approach to language learning in which corpus data is used to provide language learners with authentic examples of language use and grammatical forms from other learners. After being guided through an example interactive DDL activity using scrollable concordance lines developed in Crow lab, attendees were invited to make their own activities using data from the CROW corpus. 

Example of the scrollable concordance lines for the Crow corpus used for the interactive DDL activity

After sharing their interactive DDL activity ideas, workshop participants were asked to share their valuable feedback with us by filling out a survey feedback form. Many thanks to everyone who attended!

Later in the day, Crow’s own Dr. Nicole Schmidt also presented her research on corpus-based pedagogy. In her presentation, “Lessons Learned: Reflecting on an Online Corpus-Based Pedagogy Workshop Series,” Nicole reflected on her experiences conducting a seven-week online corpus pedagogy workshop series for University of Arizona writing instructors. A number of these instructors chose to work with Crow to create activities for their first year writing classroom. The teachers especially appreciated that the Crow texts were representative of the texts that they assigned to their own students. Nicole has also recently successfully defended her doctoral dissertation. Congratulations!

Dr. Nicole Schmidt: Lessons Learned: Reflecting on an Online Corpus-Based Pedagogy Workshop

On the way back to their Tucson nest, the Crowbirds made a brief stop in Sedona to rest their wings and appreciate the beautiful and mystical landscape.

 From left to right, Emily Palese, Dr. Shelley Staples, Aleksey Novikov, and David Marsh

Image: Concordance lines embedded from Crow’s corpus, used in a Literacy Narrative activity

As the spring season brings about renewal, Crow is excited to share our efforts in working closely with instructors to develop pedagogical materials. Dr. Shelley Staples’ CUES (Center for University Educational Scholarship) project began in Fall 2019 as a series of focus groups designed to create a conversation around what sorts of materials second language writing instructors would find useful, and how Crow could incorporate samples of student writing from our corpus into those materials.

Image: Frequency information on transition phrases in Literacy Narratives from the Crow corpus

We then used that feedback to create materials using data from our corpus. In later focus groups, our team (Nina Conrad, Emily Palese, Aleks Novikov, Kevin Sanchez, and Alantis Houpt) presented instructors with the materials we designed for their feedback and eventual implementation in their classes. Working alongside the instructors gave us insight into their needs and allowed them to have input on the types of activities they would be using. The second language writing instructors were given a variety of activities related to major assignments, which they could choose from accordingly and incorporate into their instruction. Researcher Nina Conrad attended classes to observe as instructors implemented the Crow-based materials in their classrooms.

Image: List of most frequent words used in Genre Analysis papers

Now that instructors have implemented two sets of the materials Crow has crafted, we are surveying students and instructors to ask for their feedback. We have been learning a lot from observing how participating instructors incorporate corpus-based materials into their existing curriculum. Moving forward, we plan on implementing the feedback we receive from students and instructors into our future focus groups, during which we will continue to test and improve on the corpus-based pedagogical materials we create. Some of the materials we have created can be found on our Crow for Teachers site, alongside workshops from past conferences such as AZTESOL

The Crow team would like to congratulate Dr. Adriana Picoral, who recently defended her dissertation, L3 Portuguese by Spanish-English bilinguals: Copula construction use and acquisition in corpus data.

Dr. Adriana Picoral and her committtee. From left: Mike Hammond, Shelley Staples, Adriana Picoral, Ana Carvalho, and Peter Ecke.

Building a corpus isn’t just a matter of collecting texts in a directory on a server. Crow team members are continually improving the code we use to process and de-identify contributed texts, building documentation to describe our approaches, and hosting workshops to help team members at our Arizona and Purdue sites become better corpus builders.

During Summer 2019, the Arizona team, led by Adriana Picoral, improved our existing corpus building scripts, and in some cases rewrote them to add headers, organize metadata, and perform de-identification (de-id) using Pandas (a package in Python that allows users to manipulate data). De-identifying texts is necessary to ensure our participants’ privacy, and our process includes both machine- and human-performed removal of names and other potentially identifying information. This identifiable information is replaced in the text by tags such as <name>. Based on these changes, Picoral and Aleksey Novikov have added documentation on running scripts using both Windows and Mac OS platforms.

After each new script was ready, Picoral led a series of workshops helping the Arizona Crow team learn how to run these scripts, with Purdue researchers joining remotely. Most of the participants had not used the command line before, so that was an enriching experience. The process of running these scripts on different computers and platforms also helped us identify and troubleshoot various issues, which in turn, helped us update our documentation.

Through an iterative process of randomly selecting data for manual de-identification and logging issues that Crow researchers discovered as they de-identified texts, different regular expression patterns were added to the de-id scripts to remove as many student and instructor names as possible. Regular expressions are special combinations of wildcards and other characters which perform the sophisticated matching we need to accurately de-identify texts with automated processes. We decided to flatten all the diacritics with the cleaning script because it was easier to work with names that had been standardized to a smaller character set.

During the Fall semester, we have continued to improve our de-identification processes, using an interactive tool, also developed by Picoral. The tool highlights capitalized words in each text, making it easier to spot check for names that were not caught by the de-id script, such as students’ friends or family members whose names are not automatically included in our process. Each file from Fall 2017 and Spring 2018 was manually checked by the Arizona team that included Kevin Sanchez, Jhonatan Muñoz, Picoral, and Novikov. All in all, we processed 1,547 files, spending an average of 1.5 minutes checking each file.

Because we’ve developed it as a team, the de-identification tool is user-centered, allowing Crow researchers to more quickly and effectively find and redact names and other identifiable information. In the Crow corpus, these identifiers are replaced with tokens like <name>, <place>, and the like.

To increase the quality of the de-identification for previously processed data, both the Arizona team, led by Novikov, and the Purdue team, led by Ge Lan, performed additional searches on files that were already a part of the Crow corpus, using regular expressions with potential alterations in the instructors’ names. Running an additional script removed all the standalone lines which contained just the <name> tags and no other text. These files were updated in the corpus during our October 2019 interface update.

Screenshot of Crow de-identification tool showing a segment of text with de-identification targets highlighted and options for tokens to replace them visible.
De-identification tool developed by Crow researcher Adriana Picoral

Our next steps include replicating the processes for metadata processing and adding headers to text files at Purdue, which has a slightly different metadata acquisition process compared to Arizona given differences in the demographic data recorded by both institutions. We will also continue improving the interactive de-identification tool, so that it can eventually be released to a broader audience. Sharing our work in this manner not only helps other corpus builders, but gives us other sources of feedback which can help us keep building better tools for writing research.

Over the summer, researchers Ashley Velázquez, Michelle McMullin, and Bradley Dilger worked on a series of projects that grew out of Crow’s winning Humanities Without Walls (HWW) grant. As Crow is wrapping up our work with HWW, we have been able to reflect on the benefits and resources developed through our HWW Grad Lab Practicum. 

Assisted by the HWW grant, Crow has been allowed a space to re-contextualize internal documents and processes in a format better suited to outward facing audiences. In addition, HWW closeout has allowed Crowbird researchers to pursue a series of projects and develop publication strategies for the distribution of accessible and approachable information.

These projects grew out of internal documents adjacent to the HWW grant, but are now being developed into standalone, outward-facing documentation other research teams can consult.

Best practices

Table of contents of the updated document detailing Crow best practices.
Crow best practices table of contents.

Developing out of review and updates of internal documentation were a series of updated Crow best practices, an initiative led by Velázquez. The development of a set of Crow best practices dates back to 2016, but there was a need to update and streamline them based on the new directions of Crow research and collaboration. We’ve learned that regular reconsideration of our best practices offers many benefits. Much of the currently-existing best practice material was spread through various documents and locations, and streamlining them into a single document allows for an easily referenced and accessible guide for onboarding new Crow collaborators, additionally serving as a reference guide for ongoing collaboration. Review of our existing best practices, and reflection on our development processes, has also helped us realize that writing a Crow code of ethics will be useful as we build a community of Crow researchers and users.

Mentoring and Resources for Graduate Students

Arising from her experiences as an American Association of University Women (AAUW) fellow and the GLP’s focus on professional development and grant writing, Velázquez assembled resources in order to support Crow researchers in writing and applying for grants and fellowships. As of now, this document is still focused on supporting Crow fellowship writers—particularly GLP team members—though material from will be published to a general audience in the near future, helping future AAUW fellows benefit from Velázquez’s experiences.

White paper: Focus on the Practicum lab and a toolkit for building this model for other researchers

As a result of funding and research opportunities provided by HWW, Velázquez, McMullin, and Dilger are authoring a white paper focusing on iterative, collaborative work in the context of Crow’s sustainable, interdisciplinary research. This white paper will introduce elements of constructive distributed work and the model of the GLP to other researchers, sharing our approaches and continuing the conversation started with our 2019 Computers & Writing roundtable presentation.

On Friday, October 4th, the Arizona Crowbirds opened their nest for the public to visit. During the Open Lab students, instructors, and staff were invited to explore the lab and interact with the University of Arizona’s research team. Visitors also had the opportunity to interact with Crow’s online interface, diving into the Corpus and Repository through a hands-on experience. They learned more about the different research and outreach projects the lab is involved in for both Crow and MACAWS (Multilingual Academic Corpus of Assignments–Writing and Speech), our cousin corpus.

Gathering of people standing in a small office. At left, Shelley Staples demonstrates the Crow interface.
Shelley Staples (far left), David Marsh (center), and Jhonatan Henao Muñoz (far right) chatting with open house attendees.

Thank you to everyone who attended our Open House! The Arizona Crowbirds look forward to continuing to share their progress and hard work with the community. To access the online interface, visit

Jhonatan Henao Muñoz guides an attendee through the Crow corpus web interface.
Jhonatan Henao Muñoz (right) offering a guided tour of the Crow web interface.