Corpus and Repository of Writing

The Trump administration has resumed its cruel, racist attacks on international students by establishing rules which make it nearly impossible for them to study in the United States should precautions related to COVID-19 result in more online instruction. My fellow Crow PIs and I are contacting our legislators to ask that they intervene. 

Each of us is sharing the letter below, and we’ve encouraged Crow researchers to do the same. If you’d like to borrow from our letter to write your own #StudentBan letter or op-ed, be our guest. We suggest you review the helpful suggestions regarding effective lobbying created by Chris Marsicano; they guided our work here, and we thank him for sharing.

Our letter to lawmakers

Dear Senators and Representatives,

I am one of the leaders of Crow, the Corpus & Repository of Writing, an inter-institutional research team that studies writing using computer-based tools. We collect student writing, process it, and create searchable databases that enable data-driven research. Learn more at Today we write regarding the rules the Student and Exchange Visitor Program (SEVP) plans to publish for Fall 2020 (link below). 

We are sure you realize the many contributions that international students make to higher education: tremendous intellectual engagement, diversification of communities, and tuition revenue. More concerning, however is the open hostility these proposed rules demonstrate toward international students who are already affected personally and professionally by the consequences of COVID-19. The increased uncertainty imposed by these rules, when travel, funding, and educational plans are already precarious is unacceptable.  

Our research project depends heavily on international students. They contribute much of the writing that we study, and many of our researchers are international graduate students. The proposed rules ignore the terrible consequences of the COVID-19 outbreak for students. Insisting on a stable face-to-face model for higher education creates a situation where students and faculty may have to compromise public health and personal safety in the name of regulatory compliance. Institutions will be discouraged from moving courses online even if common sense demands it. Given the terrible problems faced by our communities in Arizona, Texas, Florida, and other states, this is simply unconscionable. 

Crow research has already been slowed by the COVID-19 outbreak. These rules threaten to bring it to a complete halt. We ask that you pressure SEVP to modify these rules to offer all higher educational institutions the flexibility they need to meaningfully include international students in courses and research, whether online, face-to-face, or hybrid, for all of the coming academic year.

Please ensure more stability for our universities and our students. Thank you for your time.

Dr. Bradley Dilger, Purdue University
Dr. Shelley Staples, University of Arizona
Dr. Randi Reppen, Northern Arizona University
Dr. Ashley Velázquez, University of Washington
Dr. Michelle McMullin, North Carolina State University

This week, Crow researchers finalized the addition of 1,174 texts from Northern Arizona University to the Crow corpus. We’re thrilled to say this means we’ve hit two milestones:

Ten million words and ten thousand texts! To be exact, 10,905 and 10,155,120, respectively.

Why is this important? As we’ve previously shared, Crow members are constantly trying to improve the code used to process new files into the Crow dataset, and the addition of the NAU texts was another opportunity to improve our scripts and documentation. With the help of Adriana Picoral, Aleksey Novikov and Larissa Goulart have made changes to scripts we use to add demographic headers and de-identify texts, catered for the original NAU file structure, created by Shelley Staples and Randi Reppen in 2013–2014.

The NAU files were collected from English Composition classes taught between 2009 and 2012 to both L1 and L2 English students. Therefore, with the addition of these files, Crow now has L1 English assignments that can be explored through the Crow interface.

From a corpus linguistic perspective, this also means that now Crow contains a larger set of examples to identify patterns of learner language use. This is especially important for the study of word combinations, such as collocations and lexical bundles, as these combinations are identified based on frequency. 

Of course, this process wasn’t simple: each of the 1,174 texts had to be organized by course, assignment, first language (L1), and other metadata represented through shortcodes in each text’s filename—all part of Crow’s existing corpus design.

Subsequent steps in the preparation process were streamlined through automation tools the Crow team has developed. These include the ability to bulk convert files to plaintext format and remove non-ASCII characters, assist in de-identifying personal information, and to represent metadata in a machine-readable document header format. (These tools are open-source and available, and documenting how to use them is part of our ACLS-supported outreach work.)

Integrating the NAU texts alongside those from Purdue and Arizona also allowed us to navigate a common corpus-building challenge when materials are heterogeneously sourced: divergent metadata.

In particular, the NAU texts present information not yet represented in the other institutions’ texts —students’ L1–but simultaneously omit metadata for standardized test scores, college and program information, and gender identification.

Put one way, the Crow dataset is further evolving into a corpus consisting of multiple subcorpora.

So we had to take extra care that differences in the metadata were correct, rather than a result of miscategorization or human/machine error. We thus took this opportunity to build better auditing tools: we added a process for doing a “dry run” of the import of the texts into our online database which would report what new metadata would be added, as well as how many new texts were omitting metadata:

Screengrab clip of “dry run” for text processing with Crow corpus processing software. Shows computer program running at command line, ending in screen that reports database changes and the number of texts to be added to the corpus. 

From this report we could easily tick our acceptance criteria checkboxes (“Yes, we expect all 1,174 new texts not to have gender data”; “Yes, we expect a new category of L1 to be added”) before performing any database changes.

With the up-front work of standardizing the NAU texts to match Crow’s corpus design conventions, the final step of making those texts visible and searchable in our online interface was a (relative) snap. The consistent, machine-readable nature of the corpus records meant everything “just worked”!

Are you interested in using the Crow corpus for your research? Let us know!

Thank you to Larissa Goulart, Aleksey Novikov, Randi Reppen, Shelley Staples, Adriana Picoral and Mark Fullmer for helping us reach this important milestone, and Larissa, Shelley, Mark, and Bradley Dilger for this writeup.

Congratulations to Dr. Hadi Riad Banat, who defended his dissertation, “Assessing intercultural competence in writing programs through linked courses.” Dr. Banat’s committee was Purdue professors Dr. Tony Silva, Dr. April Ginther, Dr. Margie Berns, and Dr. Bradley Dilger.

Dr. Banat begins an appointment as assistant professor at the University of Massachussetts, Boston in Fall 2020.

From upper right: Dr. Hadi Riad Banat, with committee Dilger, Berns, Silva, and Ginther.

Congratulations to the graduate and undergraduate Crowbirds who are earning degrees this academic year!

  • Hadi Riad Banat, PhD, English (Second Language Studies), Purdue
  • Bruna Somner Farias, PhD, Second Language Acquisition and Teaching, Arizona
  • Jie Wendy Gao, PhD, English (Second Language Studies), Purdue
  • Emily Jones, BA, Professional Writing, Purdue
  • Ge Lan, PhD, English (Second Language Studies), Purdue
  • Adriana Picoral, PhD, Second Language Acquisition and Teaching, Arizona
  • Nicole Schmidt, PhD, Second Language Acquisition and Teaching, Arizona
  • David Stucker, BA, Professional Writing, Purdue
  • Yiqiu Echo Yan, BA, Retail Management, Purdue

Next week, we’ll share more news about the accomplishments of everyone on our Crow team this past year. Well done, graduates!

We’re joining the Council of Undergraduate Research to celebrate #VirtualURW2020 by spotlighting some of the wonderful contributions undergraduate researchers have made to the Crow team.

Undergraduates have been active Crow researchers since our project began at Purdue in 2015. Some examples:

See our Twitter feed to learn more about the work Echo, Ryan, Kevin, David, Alantis, and Hannah have been doing for the Crow team. Thank you, undergraduate researchers!

Congratulations to Dr. Wendy Gao, who defended her dissertation, “Linguistic Profiles of High Proficiency Mandarin and Hindi Second Language Speakers of English,” Dr. Gao’s committee was Purdue professors Dr. April Ginther, Dr. Elaine Francis, and Dr. Tony Silva, with Dr. Xun Yan from the University of Illinois Urbana-Champaign.

Dr. Wendy Jie Gao

At Purdue, Dr. Gao has been studying second language studies, language testing and corpus studies. She has been a member of Crow since she first met Dr. Staples in 2015, at the inception of the Crow program. In that time, she’s gotten through an entire PhD program—so needless to say, she’s made some big contributions to Crow throughout her tenure.

Dr. Gao’s linguistics background runs deep. She studied at Shandong University for her undergraduate education, majoring in Translation and Interpretation with a Minor in French. She continued her studies with a masters degree from Tsinghua University in Beijing, studying applied linguistics. At Purdue, her PhD studies and research have been in the areas of second language studies, language testing, and corpus studies. Between her time in mainland China, a summer semester in Taiwan, continued studies of French, and her experiences in the United States, Dr. Gao has experienced a multitude of different language environments. These diverse experiences have guided her thinking in her research positions.

Here in West Lafayette, she has held multiple positions as a research assistant. Dr. Gao began as a research assistant with Oral English Proficiency Program (OEPP) as a testing office assistant. Her job has been to evaluate the oral proficiency level of graduate students before they begin teaching in the classroom. She also served as a research assistant for the Purdue Language and Cultural Exchange (PLACE). These two experiences primed her for the Crow research environment.

Rounding out her experiences outside of Crow, Dr. Gao has seen linguistics through the eyes of the instructor. In her masters program, she was a teaching assistant, while here at Purdue, she has been an instructor for English 106, 106INTL, and 108 classes. Being an instructor can be more demanding and involved than rating papers as a research assistant, but Dr. Gao loves it nonetheless. While most of her teaching experience has been in writing, she hopes she’ll be able to teach in her true passion: linguistics.

In her time with Crow, Dr. Gao has truly run the gambit on her contributions. She was first drawn to Crow because of its overlap with her previous work in her masters program, where she developed an automatic scoring system for writing. In her tenure with Crow, she has focused on a few major projects and topics. She first focused on repository building, working to unify the corpus into a uniform format and standardize handling of pedagogical materials.

She has also spent a significant time becoming proficient at statistical programming, including SAS, SPSS, and Python. Dr. Gao has also written pedagogical papers for Crow and presented at conferences across the world, everywhere from Purdue to Birmingham, United Kingdom. And to top it all off, she has engrossed herself in grant writing. Right now, Dr. Gao spends her time in Crow serving as an external reader and provides feedback to the different groups within Crow.

Besides looking forward to postgraduate work, Dr. Gao shares her knowledge and experience with the newer Crowbirds by mentoring undergraduates in Crow at Purdue. She has worked with Echo Yan to host a workshop on Crow at a Purdue digital humanities symposium. Dr. Gao has also been working with David Stucker to find past winners of grants. This is a part of a crucial Crow mission to have graduate team members understand how to benefit from working with undergraduates.

Dr. Gao’s dissertation centers on an analysis of over 400 speech samples of higher level second language speakers of English from across East Asia. This work will go towards developing a more holistic analysis of English proficiency, incorporating vocabulary, speed, and importantly accentedness into the metric. Like fellow Crowbird Dr. Ge Lan, Dr. Gao plans to return to her native China soon, where she hopes to take a position in applied linguistics. A perfect scenario for her would be to study and teach linguistics back in China. We look forward to her continued work with our project!

We’re very glad to announce Crow researcher Dr. Ge Lan has successfully defended his dissertation, Noun Phrase Complexity, Academic Level, and First Language Background in Academic Writing. Dr. Lan’s committee was chaired by April Ginther and included Elaine Francis and Crowbirds Shelley Staples and Bradley Dilger.

Dr. Ge Lan outside the Wilmeth Active Learning Center (WALC) at Purdue University.

On Saturday, November 2nd, 2019 the Crow team swooped into beautiful Flagstaff to attend the AZTESOL 2019 conference. Presenting on the topic “Bridging L2 Writing and Academic Vocabulary Through Corpus-based Activities”, Dr. Shelley Staples first introduced the Crow project, followed by a workshop led by Aleksey Novikov, Ali Yaylali, and Emily Palese. Although they weren’t able to attend, the workshop was also shaped by valuable input from Dr. Adriana Picoral and Hannah Gill. Dr. Nicole Schmidt and David Marsh were also on hand to assist participants.

Ali Yaylali (right) and Aleksey Novikov (center) lead the DDL workshop

During the workshop, attendees were introduced to Data-Driven Learning (DDL), an inductive approach to language learning in which corpus data is used to provide language learners with authentic examples of language use and grammatical forms from other learners. After being guided through an example interactive DDL activity using scrollable concordance lines developed in Crow lab, attendees were invited to make their own activities using data from the CROW corpus. 

Example of the scrollable concordance lines for the Crow corpus used for the interactive DDL activity

After sharing their interactive DDL activity ideas, workshop participants were asked to share their valuable feedback with us by filling out a survey feedback form. Many thanks to everyone who attended!

Later in the day, Crow’s own Dr. Nicole Schmidt also presented her research on corpus-based pedagogy. In her presentation, “Lessons Learned: Reflecting on an Online Corpus-Based Pedagogy Workshop Series,” Nicole reflected on her experiences conducting a seven-week online corpus pedagogy workshop series for University of Arizona writing instructors. A number of these instructors chose to work with Crow to create activities for their first year writing classroom. The teachers especially appreciated that the Crow texts were representative of the texts that they assigned to their own students. Nicole has also recently successfully defended her doctoral dissertation. Congratulations!

Dr. Nicole Schmidt: Lessons Learned: Reflecting on an Online Corpus-Based Pedagogy Workshop

On the way back to their Tucson nest, the Crowbirds made a brief stop in Sedona to rest their wings and appreciate the beautiful and mystical landscape.

 From left to right, Emily Palese, Dr. Shelley Staples, Aleksey Novikov, and David Marsh

Image: Concordance lines embedded from Crow’s corpus, used in a Literacy Narrative activity

As the spring season brings about renewal, Crow is excited to share our efforts in working closely with instructors to develop pedagogical materials. Dr. Shelley Staples’ CUES (Center for University Educational Scholarship) project began in Fall 2019 as a series of focus groups designed to create a conversation around what sorts of materials second language writing instructors would find useful, and how Crow could incorporate samples of student writing from our corpus into those materials.

Image: Frequency information on transition phrases in Literacy Narratives from the Crow corpus

We then used that feedback to create materials using data from our corpus. In later focus groups, our team (Nina Conrad, Emily Palese, Aleks Novikov, Kevin Sanchez, and Alantis Houpt) presented instructors with the materials we designed for their feedback and eventual implementation in their classes. Working alongside the instructors gave us insight into their needs and allowed them to have input on the types of activities they would be using. The second language writing instructors were given a variety of activities related to major assignments, which they could choose from accordingly and incorporate into their instruction. Researcher Nina Conrad attended classes to observe as instructors implemented the Crow-based materials in their classrooms.

Image: List of most frequent words used in Genre Analysis papers

Now that instructors have implemented two sets of the materials Crow has crafted, we are surveying students and instructors to ask for their feedback. We have been learning a lot from observing how participating instructors incorporate corpus-based materials into their existing curriculum. Moving forward, we plan on implementing the feedback we receive from students and instructors into our future focus groups, during which we will continue to test and improve on the corpus-based pedagogical materials we create. Some of the materials we have created can be found on our Crow for Teachers site, alongside workshops from past conferences such as AZTESOL

The Crow team would like to congratulate Dr. Adriana Picoral, who recently defended her dissertation, L3 Portuguese by Spanish-English bilinguals: Copula construction use and acquisition in corpus data.

Dr. Adriana Picoral and her committtee. From left: Mike Hammond, Shelley Staples, Adriana Picoral, Ana Carvalho, and Peter Ecke.