Corpus and Repository of Writing

Building a corpus isn’t just a matter of collecting texts in a directory on a server. Crow team members are continually improving the code we use to process and de-identify contributed texts, building documentation to describe our approaches, and hosting workshops to help team members at our Arizona and Purdue sites become better corpus builders.

During Summer 2019, the Arizona team, led by Adriana Picoral, improved our existing corpus building scripts, and in some cases rewrote them to add headers, organize metadata, and perform de-identification (de-id) using Pandas (a package in Python that allows users to manipulate data). De-identifying texts is necessary to ensure our participants’ privacy, and our process includes both machine- and human-performed removal of names and other potentially identifying information. This identifiable information is replaced in the text by tags such as <name>. Based on these changes, Picoral and Aleksey Novikov have added documentation on running scripts using both Windows and Mac OS platforms.

After each new script was ready, Picoral led a series of workshops helping the Arizona Crow team learn how to run these scripts, with Purdue researchers joining remotely. Most of the participants had not used the command line before, so that was an enriching experience. The process of running these scripts on different computers and platforms also helped us identify and troubleshoot various issues, which in turn, helped us update our documentation.

Through an iterative process of randomly selecting data for manual de-identification and logging issues that Crow researchers discovered as they de-identified texts, different regular expression patterns were added to the de-id scripts to remove as many student and instructor names as possible. Regular expressions are special combinations of wildcards and other characters which perform the sophisticated matching we need to accurately de-identify texts with automated processes. We decided to flatten all the diacritics with the cleaning script because it was easier to work with names that had been standardized to a smaller character set.

During the Fall semester, we have continued to improve our de-identification processes, using an interactive tool, also developed by Picoral. The tool highlights capitalized words in each text, making it easier to spot check for names that were not caught by the de-id script, such as students’ friends or family members whose names are not automatically included in our process. Each file from Fall 2017 and Spring 2018 was manually checked by the Arizona team that included Kevin Sanchez, Jhonatan Muñoz, Picoral, and Novikov. All in all, we processed 1,547 files, spending an average of 1.5 minutes checking each file.

Because we’ve developed it as a team, the de-identification tool is user-centered, allowing Crow researchers to more quickly and effectively find and redact names and other identifiable information. In the Crow corpus, these identifiers are replaced with tokens like <name>, <place>, and the like.

To increase the quality of the de-identification for previously processed data, both the Arizona team, led by Novikov, and the Purdue team, led by Ge Lan, performed additional searches on files that were already a part of the Crow corpus, using regular expressions with potential alterations in the instructors’ names. Running an additional script removed all the standalone lines which contained just the <name> tags and no other text. These files were updated in the corpus during our October 2019 interface update.

Screenshot of Crow de-identification tool showing a segment of text with de-identification targets highlighted and options for tokens to replace them visible.
De-identification tool developed by Crow researcher Adriana Picoral

Our next steps include replicating the processes for metadata processing and adding headers to text files at Purdue, which has a slightly different metadata acquisition process compared to Arizona given differences in the demographic data recorded by both institutions. We will also continue improving the interactive de-identification tool, so that it can eventually be released to a broader audience. Sharing our work in this manner not only helps other corpus builders, but gives us other sources of feedback which can help us keep building better tools for writing research.

Over the summer, researchers Ashley Velázquez, Michelle McMullin, and Bradley Dilger worked on a series of projects that grew out of Crow’s winning Humanities Without Walls (HWW) grant. As Crow is wrapping up our work with HWW, we have been able to reflect on the benefits and resources developed through our HWW Grad Lab Practicum. 

Assisted by the HWW grant, Crow has been allowed a space to re-contextualize internal documents and processes in a format better suited to outward facing audiences. In addition, HWW closeout has allowed Crowbird researchers to pursue a series of projects and develop publication strategies for the distribution of accessible and approachable information.

These projects grew out of internal documents adjacent to the HWW grant, but are now being developed into standalone, outward-facing documentation other research teams can consult.

Best practices

Table of contents of the updated document detailing Crow best practices.
Crow best practices table of contents.

Developing out of review and updates of internal documentation were a series of updated Crow best practices, an initiative led by Velázquez. The development of a set of Crow best practices dates back to 2016, but there was a need to update and streamline them based on the new directions of Crow research and collaboration. We’ve learned that regular reconsideration of our best practices offers many benefits. Much of the currently-existing best practice material was spread through various documents and locations, and streamlining them into a single document allows for an easily referenced and accessible guide for onboarding new Crow collaborators, additionally serving as a reference guide for ongoing collaboration. Review of our existing best practices, and reflection on our development processes, has also helped us realize that writing a Crow code of ethics will be useful as we build a community of Crow researchers and users.

Mentoring and Resources for Graduate Students

Arising from her experiences as an American Association of University Women (AAUW) fellow and the GLP’s focus on professional development and grant writing, Velázquez assembled resources in order to support Crow researchers in writing and applying for grants and fellowships. As of now, this document is still focused on supporting Crow fellowship writers—particularly GLP team members—though material from will be published to a general audience in the near future, helping future AAUW fellows benefit from Velázquez’s experiences.

White paper: Focus on the Practicum lab and a toolkit for building this model for other researchers

As a result of funding and research opportunities provided by HWW, Velázquez, McMullin, and Dilger are authoring a white paper focusing on iterative, collaborative work in the context of Crow’s sustainable, interdisciplinary research. This white paper will introduce elements of constructive distributed work and the model of the GLP to other researchers, sharing our approaches and continuing the conversation started with our 2019 Computers & Writing roundtable presentation.

On Friday, October 4th, the Arizona Crowbirds opened their nest for the public to visit. During the Open Lab students, instructors, and staff were invited to explore the lab and interact with the University of Arizona’s research team. Visitors also had the opportunity to interact with Crow’s online interface, diving into the Corpus and Repository through a hands-on experience. They learned more about the different research and outreach projects the lab is involved in for both Crow and MACAWS (Multilingual Academic Corpus of Assignments–Writing and Speech), our cousin corpus.

Gathering of people standing in a small office. At left, Shelley Staples demonstrates the Crow interface.
Shelley Staples (far left), David Marsh (center), and Jhonatan Henao Muñoz (far right) chatting with open house attendees.

Thank you to everyone who attended our Open House! The Arizona Crowbirds look forward to continuing to share their progress and hard work with the community. To access the online interface, visit https://crow.corporaproject.org/.

Jhonatan Henao Muñoz guides an attendee through the Crow corpus web interface.
Jhonatan Henao Muñoz (right) offering a guided tour of the Crow web interface.

This blog post was written by Emily Palese.

Planning: Spring 2019

In between processing students’ texts for the corpus, Hadi Banat, Hannah Gill, Dr. Shelley Staples and Emily Palese met regularly during Spring 2019 to strategize about expanding Crow’s repository. At the time, the repository had 68 pedagogical materials from Purdue University, but none from the University of Arizona and no direct connections between students’ corpus texts and the repository materials. 

Hadi lead our team’s exploration of how repository materials had been processed previously, including challenges they faced, solutions they found, and questions that remained unresolved. With this context, we used Padlet to brainstorm how we might classify new materials and what information we’d like to collect from instructors when instructors share their pedagogical materials.

A section of our collaborative Padlet mindmap

Once we had a solid outline, we met with instructors and administrators from the University of Arizona’s Writing Program to pitch our ideas. Finally, with their feedback, we were able to design an online intake form with categories that would be helpful for Crow as well as instructors, administrators, and researchers.

Pilot & Processing Materials: Summer 2019

To pilot the online intake survey, we asked 8 UA instructors and administrators to let us observe and record their experiences as they uploaded their materials. This feedback helped us make some important immediate fixes and also helped us consider new approaches and modifications to the form. Another benefit of piloting the intake form is that we received additional materials that we could begin processing and adding to the repository.

Before processing any UA repository materials, Hannah and Emily first reflected on their experiences processing corpus texts and discussed the documents that had helped them navigate and manage those processing tasks. With those experiences in mind, they decided to begin two documents for their repository work: a processing guide and a corresponding task tracker.

Processing Guide: “How to Prepare Files for the Repository

To create the processing guide, Hannah and Emily first added steps from Crow’s existing corpus guide (“How to Prepare Files for ASLW”) that would apply to repository processing. Using those steps as a backbone, they began processing a set of materials from one instructor together, taking careful notes of new steps they took and key decisions they made. 

At the end of each week, they revisited these notes and discussed any lingering questions with Dr. Staples. They then added in additional explanations, details, and examples so that the process could easily and consistently be followed by other Crowbirds in the future. 

The result was a 9-page processing guide with 12 discreet processing steps:

Task Tracker: “Repository Processing Checklist”

When they worked as a team processing corpus texts in Spring 2019, Hannah, Jhonatan, and Emily used a spreadsheet to track their progress and record notes about steps if needed. This was particularly helpful on days when they worked independently on tasks; the tracker helped keep all of the team members up-to-date on the team’s progress and aware of any issues that came up. 

With this in mind, Hannah and Emily created a similar task tracker for their repository processing work. The tracking checklist was developed alongside the processing guide so that it would have exactly the same steps. With identical steps, someone using the tracking checklist could refer to the processing guide if they had questions about how to complete a particular step. Once a step is completed, the Crowbird who finished the step initials the box, leaves a comment if needed, and then moves to the next step. 

Below is a screenshot of what part of the tracking checklist looks like.

Developing the processing guide and the corresponding checklist was an iterative process that often involved revisiting previously completed tasks to refine our approach. Eventually, though, the guide and checklist became clear, consistent, and sufficiently detailed to support processing a variety of pedagogical materials.

Success!

By the end of the summer, Hannah, Emily, and Dr. Staples successfully processed 236 new pedagogical materials from the University of Arizona and added them to Crow’s online interface. 

For the first time, Crow now has direct links between students’ texts in the corpus and pedagogical materials from their classes in the repository. This linkage presents exciting new opportunities for instructors, researchers, and administrators to begin large-scale investigations of the connections between students’ drafts and corresponding instructional materials!

The team is growing at the University of Arizona, as Crowbirds welcome visiting scholar David Marsh to the flock. He is an associate professor of English at the National Institute of Technology, Wakayama College, Japan, and is currently in Arizona for one year as a CESL/SLAT Visiting Scholar. David’s research interests are related to second language teaching and corpus analysis of technical/engineering English.

In his free time, David likes to play with his son and cook.

Welcome, David, we look forward to working more with you in the future!

Bradley Dilger and Hadi Banat attended the Council of Writing Program Administrators’ annual conference in Baltimore, Maryland, and conducted a workshop to introduce the Crow platform and its various uses to the CWPA audience. Participants explored multiple features of the Crow platform and reflected on its potential uses for their own research and writing programs. After Dr. Dilger introduced the Crow project, design practices, and the technical aspects of building and maintaining the interface, graduate dissertation fellow Hadi Banat discussed Crow’s adopted methods to collect corpus texts and repository pedagogical materials. Both Dilger and Banat led a guided tour of our web interface, provided ample time for hands-on exploration, and assisted workshop participants by answering queries during our extensive individual work time. Finally, participants reported on their experience interacting with the interface, provided feedback on their interface experience, and reflected on ways to utilize this resource in their own institutional contexts.

Banat describing our approach to collaboration

During our conversations with CWPA workshop participants, we discussed the following:

  • Multiple Word Handling feature (Contains any word or Contain all words) in corpus search and possible additional interface features 
  • Our GitHub tools related to processing and de-identifying student texts
  • Pedagogical material de-identification, ownership, and labor concerns
  • Usability of the repository materials and corpus texts for graduate student practicums
  • Coding multimodal digital projects and related repository backend work 
  • Open source platform, user permissions, and access to data 
  • Open source platform and access criteria pertaining to various user profiles

After our conversations, we invited workshop participants to share more feedback with us by filling out a survey feedback form. Inspired by user experience and usability practitioners, outreach workshops and user feedback are instrumental for continuing the development of our interface. Thanks to our ACLS extension grant, we were able to offer gift cards to participants who filled out the survey, another part of our outreach work to build a network of potential Crow contributors and researchers.

Dilger responding to workshop participant questions

In addition to the time we spent in sessions and networking with other scholars and peers, we did not forget to enjoy the scenic inner harbour of Baltimore and the multicultural cuisines in the city. Dilger went for sunrise runs before breakfast and conference talks, and Banat enjoyed sunset walks after long days at the conference. (At CWPA, breakfast starts at 6:45am!) 

Baltimore’s inner harbor at sunset

At the end of the conference, CWPA organizers took us on a trip to the American Visionary Art Museum where we enjoyed snacks, desserts, and beer before we took a tour of the museum and admired its unique pieces. Through this social event, we also met new friends and had entertaining conversations outside the realm of academia.

Pieces from the American Visionary Art Museum

We hope to attend CWPA 2020 in Reno, Nevada and share our work with the writing program administration community again. 

Our Crow mascot enjoying the conference

Hadi Banat designed and delivered two Crow outreach workshops: one for Computers and Writing 2019 in East Lansing, Michigan, and one at the Council Writing Program Administrators 2019 (CWPA) in Baltimore, Maryland. With the Transculturation team, he presented the project’s pilot results at CWPA, too.

With Emily Palese and Shelley Staples, Banat sent a proposal about Crow’s repository development for the TESOL Conference to take place in Colorado on March 2020. He mentored undergraduate interns in the Transculturation Lab and coded dissertation data. His article “English in Lebanon: Policy Making, Education, and User Agency” was one of nine articles to get published in the special Middle East North African (MENA) Issue of the World Englishes Journal.

Bradley Dilger attended C&W 2019 in East Lansing, presenting with Hadi Banat and Michelle McMullin about Crow’s HWW grant and sharing the Crow interface with new audiences with Emily Jones. He presented with Banat at CWPA 2019 in Baltimore as well.

Wendy Gao has been developing four new test forms for Purdue Language and Culture Exchange Program (PLaCE) since Summer 2018. After rounds of revisions and reorganization, all the new test forms are available for use; the new ACE-in test will be administered by PLACE to more than 600 incoming undergraduate students at Purdue this fall. 

Gao co-authored a book chapter with April Ginther, “L2 Speaking: Theory and Research,” submitting their latest draft in May. Her  proposal to MWaLT (Midwest Association of Language Testers) Conference 2019—“Concept Mapping for Guiding Rater Training in an ESL Elicited Imitation Assessment”—was accepted. It is a study about rater disagreement on evaluating the listen and repeat sentence items for ACE-in.

Ge Lan cooperated with two Purdue SLS students to publish an article in System, “Does L2 write proficiency influence noun phrase complexity? A case analysis of argumentative essays written by Chinese students in a first-year composition course.” The data was from the PSLW corpus (which is an old version of the Crow corpus). 

Jhonatan Henao-Munoz attended the ATISA Summer School in Translation and Interpreting and participated in a corpus reading group with Aleksy Novikov. He also participated in the internship on Digital Humanities and Podcasting from National Humanities Center at San Diego State University. 

Emily Palese and Hannah Morgan Gill piloted and made improvements to the repository intake form that developed with Banat last spring. They also collected and processed over 200 materials from UA instructors for the repository. Emily made a Processing Guide to help future Crowbirds follow our steps.

Adriana Picoral was selected as an Arizona Data Science Ambassador. She had two papers selected to NWAV48 (New Ways of Analyzing Variation), the flagship sociolinguistics conference. Adriana also dedicated a lot of time to Crow, processing files for both ASLW and MACAWS, and improving our processing methods. She’s most excited about working with Mark Fullmer and other Crowbirds to develop a tool that will improve of Crow’s integration of corpus and repository.

Samantha Pate Rappuhn, a former Purdue undergraduate researcher, was hired as Grants and Academic Projects Coordinator for Ivy Tech Community College, working on the Kokomo, Indiana campus.

Ji-young Shin had an internship during the summer where she participated in multiple research projects. Her book review on response processes validity is in print now (Language Testing). A book chapter on stance features (Crow-data-based research presented during the TaLC) is in review. She received a small external grant (Language Learning Dissertation Grant). Lastly, she was invited to review a couple peer-reviewed journals and made a debut as an official reviewer. 

Shelley Staples was accepted as a Center for Undergraduate Education Scholarship (CUES) Fellow, with $50,000 in funding to support the development and evaluation of corpus-based pedagogical materials for the SLW classes at UA in Summer 2019. She had three journal articles accepted, one with Ge Lan for Journal of Second Language Writing on complexity in SLW, one for Register Studies, a multidimensional analysis of healthcare communication, and one for Language Testing on a multidimensional analysis of L2 written assessment. A fourth paper is now in press for a Routledge book on triangulating research methods (corpus linguistics and assessment).

She presented on multimodality in Crow, with Jeroen Gevers at the Computers and Writing conference and three papers at the Corpus Linguistics conference in Cardiff, Wales: one paper on the MD analysis of healthcare (Register Studies article mentioned above), one on an MD analysis of spoken oral assessment (see book chapter above), and a third on L2 writing complexity in the British Academic Written English corpus (a follow up from a 2016 Written Communication article).

Shelley also began her duties as Associate Director of Second Language Writing in the UA Writing Program and created versions of two of L2 writing courses for the micro campus in Lima, Peru, with Chris Tardy.

David Stucker worked as an intern for an industrial welding application firm in Greenfield, Indiana. He developed documentation for the operation and installation of an automated spark plug ground electrode welding system under contract from Federal-Mogul in France. He also developed a set of standardized manual templates and wrote a company style guide.

Ali Yayali was hired as an RA for the education extension of a grant-funded (USDA) environmental science project (SBAR) and started working with a secondary science teacher. He will design literacy and writing lessons and activities and work with English Language Learners (ELLs) in science classrooms particularly. This will help him as an initial step towards solidifying his plans to focus on scientific writing by L2 writers, secondary school genres, and the language of science. Ali also helped the Arizona Crowbirds write a proposal for the AZTESOL Conference.

From June 20–22, our Crowbirds flocked to East Lansing for this year’s Computers & Writing conference hosted at Michigan State University by a team including Crow PI Bill Hart-Davidson.

Shelley Staples and graduate student Jeroen Gevers, both from the University of Arizona, presented on multimodal and multilingual composing in FYW courses by using data from Crow corpora. Dr. Staples and Gevers discussed a multimodal multilingual remediation project in ENGL 108, the last L2 writing course in the Foundations Writing sequence at UA. They shared their methods for coding multimodal assignments, which include the use of images, text, emojis, and more, and voiced the challenges they encountered in standardizing codes. They ended with a discussion, seeking recommendations for alternative practices that require less time and less intensive labor.

Bradley Dilger, Mark Fullmer, Emily Jones, Hadi Banat, and Michelle McMullin conducted a workshop to introduce the Crow platform and its various uses to the C&W audience. Participants explored multiple features of the platform and reflected on its potential uses for their own research and writing courses. After undergraduate researcher Jones introduced the Crow project and design practices, our brilliant developer Fullmer discussed the nitty-gritty technical aspects of building and maintaining the interface. Afterwards, Dr. Dilger and Banat led a guided tour of our web interface. Dr. McMullin assisted by answering queries during our extensive individual work time. Finally, participants reported on their experience interacting with the interface and reflected on ways to utilize this resource in their own institutional contexts.

Emily Jones introduces the Crow platform, with Mark Fullmer in the background via videoconference.

Emily Jones introduces the Crow platform, with Mark Fullmer in the background via videoconference.

Dilger, Banat, and McMullin collaborated with the “Building Healthcare Collectives Team” on a roundtable which focused on research projects funded by Humanities Without Walls, and the outcomes of utilizing digital spaces and tools to build infrastructures necessary for successful collaboration among researchers and across institutions. Dr. Dilger discussed the models Crow PIs use for team building, and how Crow leaders developed collaborative writing practices, balanced individuals’ needs, and maximized professional development and team productivity. Dr. Dilger called for action, commenting on the responsibility of faculty to mentor graduate students on the skills they need to build research agendas, enter the job market, and pursue their prospective careers.

Dr. McMullin discussed the need to make teams a site for research, by interrogating practices within a collaborative community. Relying on her Crow experiences, she presented recommendations and practical tips that teams can use to create digital infrastructures and develop best practices which honor both accountability and flexibility.

Banat, Crow’s rising fifth year PhD candidate and a 2019–2020 Purdue Research Foundation Fellow, focused on performing interdisciplinarity through the transfer of research, team building, collaboration, and grant writing practices from Crow to the Transculturation in FYW research project. He highlighted the value of involvement in research teams for knowledge construction and expertise development. In his lightning talk, he outlined Crow’s grant writing strategy in detail, inviting the audience to use the same guidelines and practices at their own institutions. He emphasized the value of mentoring that research participation provides, drawing comparisons between the Humanities Lab Practicum which was a common part of our HWW projects, and the engineering research lab model. Despite the fact that this was one of the conference’s final sessions, the roundtable ended with lively conversation surrounding best practices for grant writing and team building.

As at every conference, the Crow team found time to make new friends and socialize with scholars from other institutions who are pursuing brilliant projects. Crow conference experiences are holistic and comprehensive, as we use this opportunity to reflect on our experiences and learn from them.

Afterglow at the Hart-Davidson compound

Afterglow at the Hart-Davidson compound

The dormitory accommodation was a unique experience, as our Crowbirds are used to staying in nearby hotels. The communal living made conversations with scholars, colleagues, and peers easier and smoother. We also enjoyed after-conference socials at East Lansing breweries, where we discussed types of beer, future Crow projects, and prospective career plans for Crow’s graduating students. At the end of the conference, co-host Bill Hart-Davidson invited us and other attendees to his house for snacks, laughs, and lively conversation. The real fun started when a group of conference presenters enthusiastically formed a band and played some (loud) jams. Before heading back to West Lafayette, we enjoyed a delicious vegan brunch at People’s Kitchen and reflected on our third (and hopefully not last!) time presenting at the C&W conference.

We’re happy to demonstrate the Crow system at CWPA 2019!

Thanks to our grant funding, we can offer attendees who complete this feedback form a $25 gift card! Fill out the form, then get in touch with Bradley Dilger before the end of the conference. (We’re in Baltimore until Sunday.)

Follow along as we demonstrate the Crow system, then offer everyone time to explore it on their own devices: Handout as Google Doc

We also invite you to try this demonstration version of our repository intake form — the way we collect texts from participating instructors.

Thank you for your interest! We welcome your questions.

This summer, we have the opportunity to continue sharing our interface at various conferences. We are excited to lead a mini-workshop (Session G, Sat 6/22, 2:00p, Riverside Room) at this year’s Computers & Writing conference, where we will discuss the technical and ethical processes for building our database and provide users time to explore our interface. Attending? Workshop materials are at the bottom of this page.

Exploring a web-based archive of writing and assignments

Our team has developed the first web-based archive that links a repository of pedagogical materials with a corpus of student texts written in response to those assignments in first-year composition courses. This workshop will allow participants to explore the features of our platform for their own research and writing courses. A guided tour of our web interface will be followed with extensive individual work time supported by researchers. Participants will learn to explore linguistic and rhetorical features of student writing, develop classroom activities or research plans, and explore other uses.

Shelley Staples attending a workshop C&W 2016, sketching out the Crow platform’s connections between texts

Takeaways

After our workshop, participants will be able to:

  1. Use our platform to explore linguistic and rhetorical features of student writing;
  2. Develop classroom activities or research plans based on the corpus and repository date available through our platform;
  3. Discuss how information from our platform could be further developed for research and inform language teaching;
  4. Explore opportunities for managing data for programmatic use, such as assessment or professional development.

In addition to these main goals, participants will gain a general understanding of the data processing and development required to sustain for data-driven web-based software like our platform. Interested? Keep reading for a full description of our workshop.

(more…)