Corpus and Repository of Writing

We’re in Rochester, NY for Computers & Writing 2016. We attended the computational rhetorics workshop facilitated by Ryan Omizo and Bill Hart-Davidson, and presented in session D2, “Boundary Work: Designing a Composition Archive for Research and Mentoring Across Disciplines.” That’s Friday, 5/20, 4:30 to 5:45pm, in Nursing 102.

We described our approach to developing Crow in five short talks:

  • Shelley Staples introduced our team and share our project goals.
  • For those C&W attendees not familiar with corpus linguistics, Terrence Wang offered an introduction.
  • Ashley Velázquez, reading for Lindsey Macdonald, outlined some of the pedagogical rationale for Crow, and describe some possibilities.
  • Michelle McMullin described how our approach to infrastructure draws on scholarship in professional communication.
  • Finally, Bradley Dilger concluded our panel by saying more about our approach to sustainable collaboration.

Here’s our session handout and slide deck. Thanks to those who attended!

We have more to say about the conference in another post.

Tagged with: , , , , , , ,

Finals week had just begun here at Purdue when the Crow team gathered in Heavilon Hall to kick off our summer projects. We met for some early morning sweets and some very much needed coffee to get our brains working before diving into our work. The team was assigned various tasks and dispersed. After touching base with other team members to ensure that everyone was on the same page, the bulk of the work was dedicated to de-identifying previously collected data.  

Crow team members de-identifying textsThough some of us probably could have used a bit more coffee.

Crow is built on the Purdue Corpus of Second Language Writing (PSLW), which is a collection of student-produced  documents from the ENGL 106i courses here at Purdue. Before uploading  these documents into the corpus, the documents must be de-identified. So, we split up into groups and we each tackled a group of documents. We reviewed  each document and redacted any information that could lead to the identification of the writer, including any names, locations such as hometown or dorm halls, specific course names, and specific professor names. Rather than just deleting the identifying word or words, we replaced each one  with angle brackets and the category we were replacing. For instance, a name such as “Jordan” is  replaced with “<name>”. This prevents any confusion that missing words may cause.

De-identifying the documents, though tedious and mind-numbing, is an important step in our process. At this point, we want to look for themes that spread across multiple documents, not focus on certain documents individually. That being said, the specific, identifying detail that writers may have included in their assignments become irrelevant. We also want to work to ensure that we are not creating any biases based on preexisting knowledge of who the writer is of any of the documents we are examining.

Even though we have a lot left on our to-do list, we are excited to dive in and get to work on our summer projects, and we are looking forward to the progress Crow will make during the upcoming months! We’ll be presenting at Computers & Writing 2016, and we have a lot of prototyping and design work planned. Time to get some more coffee!

Tagged with: , , , ,

We’ll be presenting the following panel at TALC in Giessen, Germany, in July 2016.

Developing a First Year Composition L2 Writing Corpus and Repository

A number of student academic writing corpora (e.g., ICLE, MICUSP, BAWE) have been developed in the past few decades, showing the interest in and importance of representing this domain of language use. These corpora have been used for countless research studies, as illustrated by the extensive bibliography on the CECL and LCA websites.

Our project, the Purdue Second Language Writing corpus (PSLW), builds on this base but aims to represent the writing produced by first year international students in the U.S. in composition courses. Such courses are provided at virtually every university in the U.S., but to date no large-scale projects have been completed. Our corpus currently includes 4,012 texts (3,472,260 words) representing 5 different genres (literacy narrative, proposal, annotated bibliography, interview report and argumentative essay), and we are currently processing a comparable amount of texts to be available by Summer 2016. The corpus contains three drafts of each assignment. The samples are annotated with writers’ TOEFL scores, nationality, and gender, among other characteristics.

Importantly, the corpus is part of a larger interdisciplinary project that represents a collaboration among students and faculty from both applied/corpus linguistics and composition studies, called CROW (Corpus and Repository of Writing). Two main features of this larger project include the development of an online interface where scholars can eventually submit their own texts, and the inclusion of pedagogical artifacts that accompany the production of the texts, including syllabi, assignment sheets, pre-writing readings, and schema building activities.  Providing these additional materials sheds light on how the texts in the corpus are developed and shaped by these instructor-designed texts. We believe that such efforts are an important way to advance corpus linguistic and language teaching research.

Our presentation will focus on two strands: the methodology for developing this new kind of corpus project, and research that has been conducted using our corpus. In terms of methodology, we will briefly cover our corpus compilation process, but focus more on the interdisciplinary practices used to guide the development of the online platform and integration of corpus texts and artifacts. We will provide a discussion of several best practices from usability design: 1) the development of persona scenarios (e.g., novice international graduate student instructor); 2) environmental scans of corpus and repository websites (e.g., MICUSP, COCA and Pedagogy Toolkit).

A number of research projects have been conducted using the PSLW corpus. We will report on the findings of one of these studies, which investigated the use of reporting verbs in students’ literature reviews. Using a framework drawing on the work of Francis, Hunston, and Manning (1996), Charles (2006), and Friginal (2013), the study showed that although L2 writers in the corpus used many verbs in the semantic categories of argue and show, mostly for textual attribution, they also employed more think verbs than advanced L1 student writers, particularly for making general statements or to express their own opinions. After discussing our research findings, we will end the presentation by offering implications of our project for corpus development and research in general.


Swatek, A., Banat, H., Staples, S. (2016, July). Developing First Year Composition L2 Writing Corpus: Research, Pedagogy and Teacher Training. Presentation at the 12th Teaching and Language Corpora Conference. Giessen, Germany.


Charles, M. (2006). Phraseological patterns in reporting clauses used in citation: A corpus-based study of theses in two disciplines. English for Specific Purposes 25(3). 310–331. doi:10.1016/j.esp.2005.05.003. Retrieved from 

Francis, G., Hunston, S., &  Manning, E. (Eds.). (1996). Collins COBUILD Grammar Patterns 1: Verbs. Amsterdam: John Benjamins Publishing Company.

Friginal, E. (2013). Developing research report writing skills using corpora. English for Specific Purposes 32(4). 208–220. doi:10.1016/j.esp.2013.06.001. Retrieved from 


Tagged with: , , , , ,

In March 2017, three conferences Crow researchers are very interested in will be held consecutively in the Pacific Northwest. (Four if you count ATTW!) We’re excited about the opportunity to attend, present (we hope), and participate in workshops and other ways. Earlier this week, we submitted two proposals for CCCC 2017. We’ve included summaries below.

Hope to see you in Portland and Seattle!

Cultivating Writing Research via Corpus and Computational Collaboration

Bill Hart-Davidson & Ryan Omizo will join Shelley Staples and Lindsey Macdonald for this panel. Here’s the opening statement:

In March 2017, CCCC will be joined in Portland by AAAL, the conference of the American Association for Applied Linguistics. We take this opportunity to highlight the value of collaboration between researchers who will be attending one, but likely not both, of these conferences, and unfortunately, crossing paths in few ways. The corpus linguistics methods common in applied linguistics can bring quantitative elements to empirical research in rhetoric and composition, including attention to demographic issues and diverse genres. Rhetorical research, conversely, offers corpus researchers valuable insights into extra-textual features and contextual influences. This panel explores possibilities for collaborative writing research by demonstrating the value of this interdisciplinary work. We offer an overview of the benefits of corpus and computational methods, then present case studies of two projects which integrate computational methods and corpus linguistics with rhetoric and composition. We conclude with a brief panel discussion of takeaways for interdisciplinary collaboration, then invite conversation.

Promoting RAD Writing Research through Inter-Institutional Collaboration

Michelle McMullin, Terrence Wang, and Bradley Dilger proposed this session. Here are some excerpts from the proposal:

Empirical research in composition and rhetoric has become more common. Diverse research projects investigate all areas of the field, including writing transfer, undergraduate writing majors, and the literacies of working class and underrepresented minorities. But scholar-teachers at all levels still struggle to implement lessons from published research at their own institutions, and to explain the relevance of research to administrators…. In this presentation, we describe how research designed as inter-institutional from its inception has embedded attention to diverse research outcomes, the development of sustainable infrastructures, and the lifecycle model of scalable user-centered development. Our project brings the methods of corpus linguistics to rhetoric and composition, and vice-versa, creating a web-based archive for research and professional development. By embedding an interdisciplinary approach to collaboration from the start, we have developed a project that considers the strengths and contributions of each partner for an effective collaboration model that best serves the needs of all stakeholders.

Tagged with: , , , , , , ,

At the end of our first academic year, the Crowbirds got together at Bradley’s house for a picnic, barbeque, and conversation. Madelyn and Amelia decorated, everyone brought wonderful food, and we had a great time — as you can see!


We are very proud of the progress we’ve made this year. A lot of our team members are traveling, and we’ll miss them. We look forward to a productive summer.

Tagged with: , ,