Corpus and Repository of Writing

Building a corpus isn’t just a matter of collecting texts in a directory on a server. Crow team members are continually improving the code we use to process and de-identify contributed texts, building documentation to describe our approaches, and hosting workshops to help team members at our Arizona and Purdue sites become better corpus builders.

During Summer 2019, the Arizona team, led by Adriana Picoral, improved our existing corpus building scripts, and in some cases rewrote them to add headers, organize metadata, and perform de-identification (de-id) using Pandas (a package in Python that allows users to manipulate data). De-identifying texts is necessary to ensure our participants’ privacy, and our process includes both machine- and human-performed removal of names and other potentially identifying information. This identifiable information is replaced in the text by tags such as <name>. Based on these changes, Picoral and Aleksey Novikov have added documentation on running scripts using both Windows and Mac OS platforms.

After each new script was ready, Picoral led a series of workshops helping the Arizona Crow team learn how to run these scripts, with Purdue researchers joining remotely. Most of the participants had not used the command line before, so that was an enriching experience. The process of running these scripts on different computers and platforms also helped us identify and troubleshoot various issues, which in turn, helped us update our documentation.

Through an iterative process of randomly selecting data for manual de-identification and logging issues that Crow researchers discovered as they de-identified texts, different regular expression patterns were added to the de-id scripts to remove as many student and instructor names as possible. Regular expressions are special combinations of wildcards and other characters which perform the sophisticated matching we need to accurately de-identify texts with automated processes. We decided to flatten all the diacritics with the cleaning script because it was easier to work with names that had been standardized to a smaller character set.

During the Fall semester, we have continued to improve our de-identification processes, using an interactive tool, also developed by Picoral. The tool highlights capitalized words in each text, making it easier to spot check for names that were not caught by the de-id script, such as students’ friends or family members whose names are not automatically included in our process. Each file from Fall 2017 and Spring 2018 was manually checked by the Arizona team that included Kevin Sanchez, Jhonatan Muñoz, Picoral, and Novikov. All in all, we processed 1,547 files, spending an average of 1.5 minutes checking each file.

Because we’ve developed it as a team, the de-identification tool is user-centered, allowing Crow researchers to more quickly and effectively find and redact names and other identifiable information. In the Crow corpus, these identifiers are replaced with tokens like <name>, <place>, and the like.

To increase the quality of the de-identification for previously processed data, both the Arizona team, led by Novikov, and the Purdue team, led by Ge Lan, performed additional searches on files that were already a part of the Crow corpus, using regular expressions with potential alterations in the instructors’ names. Running an additional script removed all the standalone lines which contained just the <name> tags and no other text. These files were updated in the corpus during our October 2019 interface update.

Screenshot of Crow de-identification tool showing a segment of text with de-identification targets highlighted and options for tokens to replace them visible.
De-identification tool developed by Crow researcher Adriana Picoral

Our next steps include replicating the processes for metadata processing and adding headers to text files at Purdue, which has a slightly different metadata acquisition process compared to Arizona given differences in the demographic data recorded by both institutions. We will also continue improving the interactive de-identification tool, so that it can eventually be released to a broader audience. Sharing our work in this manner not only helps other corpus builders, but gives us other sources of feedback which can help us keep building better tools for writing research.

Over the summer, researchers Ashley Velázquez, Michelle McMullin, and Bradley Dilger worked on a series of projects that grew out of Crow’s winning Humanities Without Walls (HWW) grant. As Crow is wrapping up our work with HWW, we have been able to reflect on the benefits and resources developed through our HWW Grad Lab Practicum. 

Assisted by the HWW grant, Crow has been allowed a space to re-contextualize internal documents and processes in a format better suited to outward facing audiences. In addition, HWW closeout has allowed Crowbird researchers to pursue a series of projects and develop publication strategies for the distribution of accessible and approachable information.

These projects grew out of internal documents adjacent to the HWW grant, but are now being developed into standalone, outward-facing documentation other research teams can consult.

Best practices

Table of contents of the updated document detailing Crow best practices.
Crow best practices table of contents.

Developing out of review and updates of internal documentation were a series of updated Crow best practices, an initiative led by Velázquez. The development of a set of Crow best practices dates back to 2016, but there was a need to update and streamline them based on the new directions of Crow research and collaboration. We’ve learned that regular reconsideration of our best practices offers many benefits. Much of the currently-existing best practice material was spread through various documents and locations, and streamlining them into a single document allows for an easily referenced and accessible guide for onboarding new Crow collaborators, additionally serving as a reference guide for ongoing collaboration. Review of our existing best practices, and reflection on our development processes, has also helped us realize that writing a Crow code of ethics will be useful as we build a community of Crow researchers and users.

Mentoring and Resources for Graduate Students

Arising from her experiences as an American Association of University Women (AAUW) fellow and the GLP’s focus on professional development and grant writing, Velázquez assembled resources in order to support Crow researchers in writing and applying for grants and fellowships. As of now, this document is still focused on supporting Crow fellowship writers—particularly GLP team members—though material from will be published to a general audience in the near future, helping future AAUW fellows benefit from Velázquez’s experiences.

White paper: Focus on the Practicum lab and a toolkit for building this model for other researchers

As a result of funding and research opportunities provided by HWW, Velázquez, McMullin, and Dilger are authoring a white paper focusing on iterative, collaborative work in the context of Crow’s sustainable, interdisciplinary research. This white paper will introduce elements of constructive distributed work and the model of the GLP to other researchers, sharing our approaches and continuing the conversation started with our 2019 Computers & Writing roundtable presentation.

On Friday, October 4th, the Arizona Crowbirds opened their nest for the public to visit. During the Open Lab students, instructors, and staff were invited to explore the lab and interact with the University of Arizona’s research team. Visitors also had the opportunity to interact with Crow’s online interface, diving into the Corpus and Repository through a hands-on experience. They learned more about the different research and outreach projects the lab is involved in for both Crow and MACAWS (Multilingual Academic Corpus of Assignments–Writing and Speech), our cousin corpus.

Gathering of people standing in a small office. At left, Shelley Staples demonstrates the Crow interface.
Shelley Staples (far left), David Marsh (center), and Jhonatan Henao Muñoz (far right) chatting with open house attendees.

Thank you to everyone who attended our Open House! The Arizona Crowbirds look forward to continuing to share their progress and hard work with the community. To access the online interface, visit https://crow.corporaproject.org/.

Jhonatan Henao Muñoz guides an attendee through the Crow corpus web interface.
Jhonatan Henao Muñoz (right) offering a guided tour of the Crow web interface.

This blog post was written by Emily Palese.

Planning: Spring 2019

In between processing students’ texts for the corpus, Hadi Banat, Hannah Gill, Dr. Shelley Staples and Emily Palese met regularly during Spring 2019 to strategize about expanding Crow’s repository. At the time, the repository had 68 pedagogical materials from Purdue University, but none from the University of Arizona and no direct connections between students’ corpus texts and the repository materials. 

Hadi lead our team’s exploration of how repository materials had been processed previously, including challenges they faced, solutions they found, and questions that remained unresolved. With this context, we used Padlet to brainstorm how we might classify new materials and what information we’d like to collect from instructors when instructors share their pedagogical materials.

A section of our collaborative Padlet mindmap

Once we had a solid outline, we met with instructors and administrators from the University of Arizona’s Writing Program to pitch our ideas. Finally, with their feedback, we were able to design an online intake form with categories that would be helpful for Crow as well as instructors, administrators, and researchers.

Pilot & Processing Materials: Summer 2019

To pilot the online intake survey, we asked 8 UA instructors and administrators to let us observe and record their experiences as they uploaded their materials. This feedback helped us make some important immediate fixes and also helped us consider new approaches and modifications to the form. Another benefit of piloting the intake form is that we received additional materials that we could begin processing and adding to the repository.

Before processing any UA repository materials, Hannah and Emily first reflected on their experiences processing corpus texts and discussed the documents that had helped them navigate and manage those processing tasks. With those experiences in mind, they decided to begin two documents for their repository work: a processing guide and a corresponding task tracker.

Processing Guide: “How to Prepare Files for the Repository

To create the processing guide, Hannah and Emily first added steps from Crow’s existing corpus guide (“How to Prepare Files for ASLW”) that would apply to repository processing. Using those steps as a backbone, they began processing a set of materials from one instructor together, taking careful notes of new steps they took and key decisions they made. 

At the end of each week, they revisited these notes and discussed any lingering questions with Dr. Staples. They then added in additional explanations, details, and examples so that the process could easily and consistently be followed by other Crowbirds in the future. 

The result was a 9-page processing guide with 12 discreet processing steps:

Task Tracker: “Repository Processing Checklist”

When they worked as a team processing corpus texts in Spring 2019, Hannah, Jhonatan, and Emily used a spreadsheet to track their progress and record notes about steps if needed. This was particularly helpful on days when they worked independently on tasks; the tracker helped keep all of the team members up-to-date on the team’s progress and aware of any issues that came up. 

With this in mind, Hannah and Emily created a similar task tracker for their repository processing work. The tracking checklist was developed alongside the processing guide so that it would have exactly the same steps. With identical steps, someone using the tracking checklist could refer to the processing guide if they had questions about how to complete a particular step. Once a step is completed, the Crowbird who finished the step initials the box, leaves a comment if needed, and then moves to the next step. 

Below is a screenshot of what part of the tracking checklist looks like.

Developing the processing guide and the corresponding checklist was an iterative process that often involved revisiting previously completed tasks to refine our approach. Eventually, though, the guide and checklist became clear, consistent, and sufficiently detailed to support processing a variety of pedagogical materials.

Success!

By the end of the summer, Hannah, Emily, and Dr. Staples successfully processed 236 new pedagogical materials from the University of Arizona and added them to Crow’s online interface. 

For the first time, Crow now has direct links between students’ texts in the corpus and pedagogical materials from their classes in the repository. This linkage presents exciting new opportunities for instructors, researchers, and administrators to begin large-scale investigations of the connections between students’ drafts and corresponding instructional materials!

The team is growing at the University of Arizona, as Crowbirds welcome visiting scholar David Marsh to the flock. He is an associate professor of English at the National Institute of Technology, Wakayama College, Japan, and is currently in Arizona for one year as a CESL/SLAT Visiting Scholar. David’s research interests are related to second language teaching and corpus analysis of technical/engineering English.

In his free time, David likes to play with his son and cook.

Welcome, David, we look forward to working more with you in the future!

Bradley Dilger and Hadi Banat attended the Council of Writing Program Administrators’ annual conference in Baltimore, Maryland, and conducted a workshop to introduce the Crow platform and its various uses to the CWPA audience. Participants explored multiple features of the Crow platform and reflected on its potential uses for their own research and writing programs. After Dr. Dilger introduced the Crow project, design practices, and the technical aspects of building and maintaining the interface, graduate dissertation fellow Hadi Banat discussed Crow’s adopted methods to collect corpus texts and repository pedagogical materials. Both Dilger and Banat led a guided tour of our web interface, provided ample time for hands-on exploration, and assisted workshop participants by answering queries during our extensive individual work time. Finally, participants reported on their experience interacting with the interface, provided feedback on their interface experience, and reflected on ways to utilize this resource in their own institutional contexts.

Banat describing our approach to collaboration

During our conversations with CWPA workshop participants, we discussed the following:

  • Multiple Word Handling feature (Contains any word or Contain all words) in corpus search and possible additional interface features 
  • Our GitHub tools related to processing and de-identifying student texts
  • Pedagogical material de-identification, ownership, and labor concerns
  • Usability of the repository materials and corpus texts for graduate student practicums
  • Coding multimodal digital projects and related repository backend work 
  • Open source platform, user permissions, and access to data 
  • Open source platform and access criteria pertaining to various user profiles

After our conversations, we invited workshop participants to share more feedback with us by filling out a survey feedback form. Inspired by user experience and usability practitioners, outreach workshops and user feedback are instrumental for continuing the development of our interface. Thanks to our ACLS extension grant, we were able to offer gift cards to participants who filled out the survey, another part of our outreach work to build a network of potential Crow contributors and researchers.

Dilger responding to workshop participant questions

In addition to the time we spent in sessions and networking with other scholars and peers, we did not forget to enjoy the scenic inner harbour of Baltimore and the multicultural cuisines in the city. Dilger went for sunrise runs before breakfast and conference talks, and Banat enjoyed sunset walks after long days at the conference. (At CWPA, breakfast starts at 6:45am!) 

Baltimore’s inner harbor at sunset

At the end of the conference, CWPA organizers took us on a trip to the American Visionary Art Museum where we enjoyed snacks, desserts, and beer before we took a tour of the museum and admired its unique pieces. Through this social event, we also met new friends and had entertaining conversations outside the realm of academia.

Pieces from the American Visionary Art Museum

We hope to attend CWPA 2020 in Reno, Nevada and share our work with the writing program administration community again. 

Our Crow mascot enjoying the conference

From June 20–22, our Crowbirds flocked to East Lansing for this year’s Computers & Writing conference hosted at Michigan State University by a team including Crow PI Bill Hart-Davidson.

Shelley Staples and graduate student Jeroen Gevers, both from the University of Arizona, presented on multimodal and multilingual composing in FYW courses by using data from Crow corpora. Dr. Staples and Gevers discussed a multimodal multilingual remediation project in ENGL 108, the last L2 writing course in the Foundations Writing sequence at UA. They shared their methods for coding multimodal assignments, which include the use of images, text, emojis, and more, and voiced the challenges they encountered in standardizing codes. They ended with a discussion, seeking recommendations for alternative practices that require less time and less intensive labor.

Bradley Dilger, Mark Fullmer, Emily Jones, Hadi Banat, and Michelle McMullin conducted a workshop to introduce the Crow platform and its various uses to the C&W audience. Participants explored multiple features of the platform and reflected on its potential uses for their own research and writing courses. After undergraduate researcher Jones introduced the Crow project and design practices, our brilliant developer Fullmer discussed the nitty-gritty technical aspects of building and maintaining the interface. Afterwards, Dr. Dilger and Banat led a guided tour of our web interface. Dr. McMullin assisted by answering queries during our extensive individual work time. Finally, participants reported on their experience interacting with the interface and reflected on ways to utilize this resource in their own institutional contexts.

Emily Jones introduces the Crow platform, with Mark Fullmer in the background via videoconference.

Emily Jones introduces the Crow platform, with Mark Fullmer in the background via videoconference.

Dilger, Banat, and McMullin collaborated with the “Building Healthcare Collectives Team” on a roundtable which focused on research projects funded by Humanities Without Walls, and the outcomes of utilizing digital spaces and tools to build infrastructures necessary for successful collaboration among researchers and across institutions. Dr. Dilger discussed the models Crow PIs use for team building, and how Crow leaders developed collaborative writing practices, balanced individuals’ needs, and maximized professional development and team productivity. Dr. Dilger called for action, commenting on the responsibility of faculty to mentor graduate students on the skills they need to build research agendas, enter the job market, and pursue their prospective careers.

Dr. McMullin discussed the need to make teams a site for research, by interrogating practices within a collaborative community. Relying on her Crow experiences, she presented recommendations and practical tips that teams can use to create digital infrastructures and develop best practices which honor both accountability and flexibility.

Banat, Crow’s rising fifth year PhD candidate and a 2019–2020 Purdue Research Foundation Fellow, focused on performing interdisciplinarity through the transfer of research, team building, collaboration, and grant writing practices from Crow to the Transculturation in FYW research project. He highlighted the value of involvement in research teams for knowledge construction and expertise development. In his lightning talk, he outlined Crow’s grant writing strategy in detail, inviting the audience to use the same guidelines and practices at their own institutions. He emphasized the value of mentoring that research participation provides, drawing comparisons between the Humanities Lab Practicum which was a common part of our HWW projects, and the engineering research lab model. Despite the fact that this was one of the conference’s final sessions, the roundtable ended with lively conversation surrounding best practices for grant writing and team building.

As at every conference, the Crow team found time to make new friends and socialize with scholars from other institutions who are pursuing brilliant projects. Crow conference experiences are holistic and comprehensive, as we use this opportunity to reflect on our experiences and learn from them.

Afterglow at the Hart-Davidson compound

Afterglow at the Hart-Davidson compound

The dormitory accommodation was a unique experience, as our Crowbirds are used to staying in nearby hotels. The communal living made conversations with scholars, colleagues, and peers easier and smoother. We also enjoyed after-conference socials at East Lansing breweries, where we discussed types of beer, future Crow projects, and prospective career plans for Crow’s graduating students. At the end of the conference, co-host Bill Hart-Davidson invited us and other attendees to his house for snacks, laughs, and lively conversation. The real fun started when a group of conference presenters enthusiastically formed a band and played some (loud) jams. Before heading back to West Lafayette, we enjoyed a delicious vegan brunch at People’s Kitchen and reflected on our third (and hopefully not last!) time presenting at the C&W conference.

We’re happy to demonstrate the Crow system at CWPA 2019!

Thanks to our grant funding, we can offer attendees who complete this feedback form a $25 gift card! Fill out the form, then get in touch with Bradley Dilger before the end of the conference. (We’re in Baltimore until Sunday.)

Follow along as we demonstrate the Crow system, then offer everyone time to explore it on their own devices: Handout as Google Doc

We also invite you to try this demonstration version of our repository intake form — the way we collect texts from participating instructors.

Thank you for your interest! We welcome your questions.

This summer, we have the opportunity to continue sharing our interface at various conferences. We are excited to lead a mini-workshop (Session G, Sat 6/22, 2:00p, Riverside Room) at this year’s Computers & Writing conference, where we will discuss the technical and ethical processes for building our database and provide users time to explore our interface. Attending? Workshop materials are at the bottom of this page.

Exploring a web-based archive of writing and assignments

Our team has developed the first web-based archive that links a repository of pedagogical materials with a corpus of student texts written in response to those assignments in first-year composition courses. This workshop will allow participants to explore the features of our platform for their own research and writing courses. A guided tour of our web interface will be followed with extensive individual work time supported by researchers. Participants will learn to explore linguistic and rhetorical features of student writing, develop classroom activities or research plans, and explore other uses.

Shelley Staples attending a workshop C&W 2016, sketching out the Crow platform’s connections between texts

Takeaways

After our workshop, participants will be able to:

  1. Use our platform to explore linguistic and rhetorical features of student writing;
  2. Develop classroom activities or research plans based on the corpus and repository date available through our platform;
  3. Discuss how information from our platform could be further developed for research and inform language teaching;
  4. Explore opportunities for managing data for programmatic use, such as assessment or professional development.

In addition to these main goals, participants will gain a general understanding of the data processing and development required to sustain for data-driven web-based software like our platform. Interested? Keep reading for a full description of our workshop.

(more…)

We are very excited to announce that the Crow team has been awarded the American Council of Learned Societies (ACLS) Digital Extension grant in the amount of $150,000. Congratulations to our Crow team, and in particular, to Shelley Staples, Ashley Velázquez, Hadi Banat, Bradley Dilger, Ali Yaylali, Aleksey Novikov, and Adriana Picoral. These Crowbirds contributed extensively to developing our application. We also wish to thank those at University of Arizona who supported our grant writing and submission: Kim Patton (Research, Discovery, & Innovation), Beth E. Stahmer (Social and Behavioral Science Research Institute), and Jane Zavisca (Associate Dean for Research, College of Social and Behavioral Sciences).

ACLS Digital Extension grants support digital research projects in humanities and the humanistic social sciences. According to John Paul Christy, director of public programs at ACLS, “This year’s awardees share a commitment to the kinds of community building – across disciplines, institutions, languages and cultures – that strengthen the enterprise of the digital humanities.” The Crow team is thrilled to be one of the first writing research projects funded by ACLS (if not the first one).

Our project, “Expanding the Corpus and Repository of Writing: An Archive of Multilingual Writing in English,” will run for three semesters, from July 2019 until December 2020. Key personnel on the grant include Staples (PI) and Dilger (Co-PI), research assistants at Arizona (Novikov; Picoral; Yalalyi) and Purdue (Lan; Gao) as well as undergraduate research assistants at Purdue. We also will continue to work with our amazing developer, Mark Fullmer.

This grant will allow our team to advance research in several areas. First, it will help us expand our data collection of multilingual writers to a new population of heritage Spanish writers at the University of Arizona (a newly designated Hispanic Serving Institution). Second, we will be able to automate some of our intertextuality research by creating a new computational tool. Our final goal for this project is to offer extensive outreach to researchers, teacher-researchers, and developers. To reach this goal, we plan to conduct multiple training workshops for teachers and researchers on how to use the Crow platform, as well as train teacher-researchers how to add their own texts to the Crow platform and train developers on how to use the API for their own projects. ACLS support will enable us to offer support and incentives to these educators.

Visualization of extension of Crow supported by grant: reaching new audiences in new ways.

Thanks again to everyone involved in various steps of the grant application in different capacities. We remain grateful to our current funders, the Humanities Without Walls Consortium, and our institutions, Purdue University, the University of Arizona, and Michigan State University. We are very happy to continue expanding Crow with the help of their continuing support.