Corpus and Repository of Writing

Terrence, Michelle, Shelley, and Bradley had an excellent week at Computers & Writing 2016 in Rochester. We got to explore the interdisciplinary nature of the Crow project through the workshop we attended, our presentation, and several other interesting panels. Lots of good thinking about the relationship between corpus linguistics, pedagogy, mentoring, and building a sustainable archive.

Our conference began with the Ride2CW celebration at the Tap and Mallet — great food, good beer, and smart conversations already starting. The next morning, Bradley and Bill Hart-Davidson rode along the Erie Canal, which was just two miles from host St. John Fisher College. Yay, Ride2CW!

The four of us attended Ryan Omizo and Hart-Davidson’s workshop on computational rhetoric, where we could start imagining what data representations might look like for Crow. We developed some great questions about our data structure and the multiple users for whom we are designing.

We attended a variety of sessions which were interesting and relevant to our project. In A5, we heard Naomi Silver and others from the Sweetland Center for Writing talk about their collaborative processes. We liked seeing what Erin Trauth, Joe Moxley, and Norbert Elliot were doing with MyReviewers data on an NSF-funded project, and we’ll definitely be following up with them. We’re hoping to make it to Writing Analytics, Data Mining and Student Success in January 2017.

Session G3, which featured Ben Miller, Jason Palmeri, and Ben McCorkle, offered in an-depth look at two projects: Palmeri and McCorkle’s ongoing investigation of English Journal, which goes back 100 years, and Miller’s work with rhetcomp dissertations. Excellent as presented and in Twitter backchannel.

Our talk was session D2. We were pleased by the attendance and the conversation which followed. Michelle built a Storify which features Nick Carbone’s live-tweeting (thanks, Nick!) and some of the questions, too:

  • Hart-Davidson asked what our minimum value proposition will be: what will provide short term results as we build Crow from the ground up? We agreed it’s PSLW, which is already helping us publish results in journals and at conferences.
  • Elliot suggested working with N-grams, or strings of words that may perform certain rhetorical functions (e.g., according to the; the first article).
  • Cheryl Ball asked to hear more about our “deidentification parties” and our methods for digital collaboration. Yay, Basecamp!

From the repeated names here, we realized there aren’t too many people working in the computational rhetorics, nerd data crunching, whatever you want to call the corner of the field we’re working in.  That’s probably the reason we heard, in the panels we attended, at least as many references to scholars in digital humanities but outside rhetoric and composition. Just not enough voices inside the field. We’re particularly happy to note that Crow will add a few more women to the mix.

Driving back, we debriefed and finalized our summer plans. Shelley, Terrence, and Michelle worked in Basecamp and Google Docs while Bradley drove, and it took almost seven hours for the four of us to talk through our conference experiences. With that work done, and about two hours of driving left, we started getting a little chirpy. Then we saw on Twitter that some conference-goers were still in the airport. And we realized there were strong positives to driving!

Next year, the conference will be June 1–4 at the University of Findlay in Ohio, less than four hours away. So we’ll probably have a Crow team there again. If Bradley trains enough, it’s only a two day bike ride…

Tagged with: , , , , ,

“The Design and Research Potential of Crow for Language Research and Teaching”

by Sherri Craig and Jie (Wendy) Gao

The 2016 Purdue Languages and Cultures Conference (PLCC) was the first time the School of Languages and Cultures partnered with the Second Language Studies graduate program to host an interdisciplinary three day conference. This unique structure offered a perfect opportunity for Crow to have its inaugural presentation, titled “The Design and Research Potential of Crow for Language Research and Teaching” provided by Sherri Craig and Jie (Wendy) Gao.
Listed in the conference program as part of a corpus linguistics panel, the presentation focused on answering a few questions: What is corpus? What is Crow? What are the previous research projects and future research opportunities related with Crow? How is the Crow project progressing?

At the date of the presentation, March 6, 2016, Crow was still in its beginning stages. Therefore, much of the presentation revealed a preliminary introduction of the whole project, and reported all the work the team had completed so far, including the environmental scans and persona and scenario design work. During the PLCC presentation, Sherri and Wendy revealed Crow’s ties to three previous projects rooted in the Purdue Second Language Studies program and Rhetoric and Composition program: COIN, PSLW, and the 2014-15 ICaP Assessment. Each of these previous projects contained elements of Crow’s new goals. COIN, now a defunct program, attempted to gather pedagogical materials for an online repository. PSLW is an active corpus of texts from second language writers containing over 3.4 million words. And the 2014-15 ICaP Assessment, led by Dr. Jennifer Bay, began to evaluate the pedagogical needs of writing instructors by gathering student texts and teaching materials. Despite the strength of the previous programs, Crow was designed to bring the interests of the SLS program and RC program together to develop an online repository and corpus for a broader audience.

After discussing the overview and related projects, the Sherri and Wendy discussed the environmental scans performed on MICUSP and Sketch Engine before discussing the 4 personas that inspire the user design.

Overall the PLCC presentation went off without a hitch. During Q&A the audience members were very interested in how to make better use of corpus in the future. One listener even asked if they could use Crow for their own work and courses. Others asked quite a lot of technical questions about the design of the future site and the development of the project and corpus. With the help of Dr. Staples and Dr. Dilger in the audience, all the questions were responded to and excitement for Crow spread. Everyone in attendance, including Sherri and Wendy, were strongly motivated to see how this project will develop as progress continues.

Tagged with: , , , ,

We’re in Rochester, NY for Computers & Writing 2016. We attended the computational rhetorics workshop facilitated by Ryan Omizo and Bill Hart-Davidson, and presented in session D2, “Boundary Work: Designing a Composition Archive for Research and Mentoring Across Disciplines.” That’s Friday, 5/20, 4:30 to 5:45pm, in Nursing 102.

We described our approach to developing Crow in five short talks:

  • Shelley Staples introduced our team and share our project goals.
  • For those C&W attendees not familiar with corpus linguistics, Terrence Wang offered an introduction.
  • Ashley Velázquez, reading for Lindsey Macdonald, outlined some of the pedagogical rationale for Crow, and describe some possibilities.
  • Michelle McMullin described how our approach to infrastructure draws on scholarship in professional communication.
  • Finally, Bradley Dilger concluded our panel by saying more about our approach to sustainable collaboration.

Here’s our session handout and slide deck. Thanks to those who attended!

We have more to say about the conference in another post.

Tagged with: , , , , , , ,

Finals week had just begun here at Purdue when the Crow team gathered in Heavilon Hall to kick off our summer projects. We met for some early morning sweets and some very much needed coffee to get our brains working before diving into our work. The team was assigned various tasks and dispersed. After touching base with other team members to ensure that everyone was on the same page, the bulk of the work was dedicated to de-identifying previously collected data.  

Crow team members de-identifying textsThough some of us probably could have used a bit more coffee.

Crow is built on the Purdue Corpus of Second Language Writing (PSLW), which is a collection of student-produced  documents from the ENGL 106i courses here at Purdue. Before uploading  these documents into the corpus, the documents must be de-identified. So, we split up into groups and we each tackled a group of documents. We reviewed  each document and redacted any information that could lead to the identification of the writer, including any names, locations such as hometown or dorm halls, specific course names, and specific professor names. Rather than just deleting the identifying word or words, we replaced each one  with angle brackets and the category we were replacing. For instance, a name such as “Jordan” is  replaced with “<name>”. This prevents any confusion that missing words may cause.

De-identifying the documents, though tedious and mind-numbing, is an important step in our process. At this point, we want to look for themes that spread across multiple documents, not focus on certain documents individually. That being said, the specific, identifying detail that writers may have included in their assignments become irrelevant. We also want to work to ensure that we are not creating any biases based on preexisting knowledge of who the writer is of any of the documents we are examining.

Even though we have a lot left on our to-do list, we are excited to dive in and get to work on our summer projects, and we are looking forward to the progress Crow will make during the upcoming months! We’ll be presenting at Computers & Writing 2016, and we have a lot of prototyping and design work planned. Time to get some more coffee!

Tagged with: , , , ,

We’ll be presenting the following panel at TALC in Giessen, Germany, in July 2016.

Developing a First Year Composition L2 Writing Corpus and Repository

A number of student academic writing corpora (e.g., ICLE, MICUSP, BAWE) have been developed in the past few decades, showing the interest in and importance of representing this domain of language use. These corpora have been used for countless research studies, as illustrated by the extensive bibliography on the CECL and LCA websites.

Our project, the Purdue Second Language Writing corpus (PSLW), builds on this base but aims to represent the writing produced by first year international students in the U.S. in composition courses. Such courses are provided at virtually every university in the U.S., but to date no large-scale projects have been completed. Our corpus currently includes 4,012 texts (3,472,260 words) representing 5 different genres (literacy narrative, proposal, annotated bibliography, interview report and argumentative essay), and we are currently processing a comparable amount of texts to be available by Summer 2016. The corpus contains three drafts of each assignment. The samples are annotated with writers’ TOEFL scores, nationality, and gender, among other characteristics.

Importantly, the corpus is part of a larger interdisciplinary project that represents a collaboration among students and faculty from both applied/corpus linguistics and composition studies, called CROW (Corpus and Repository of Writing). Two main features of this larger project include the development of an online interface where scholars can eventually submit their own texts, and the inclusion of pedagogical artifacts that accompany the production of the texts, including syllabi, assignment sheets, pre-writing readings, and schema building activities.  Providing these additional materials sheds light on how the texts in the corpus are developed and shaped by these instructor-designed texts. We believe that such efforts are an important way to advance corpus linguistic and language teaching research.

Our presentation will focus on two strands: the methodology for developing this new kind of corpus project, and research that has been conducted using our corpus. In terms of methodology, we will briefly cover our corpus compilation process, but focus more on the interdisciplinary practices used to guide the development of the online platform and integration of corpus texts and artifacts. We will provide a discussion of several best practices from usability design: 1) the development of persona scenarios (e.g., novice international graduate student instructor); 2) environmental scans of corpus and repository websites (e.g., MICUSP, COCA and Pedagogy Toolkit).

A number of research projects have been conducted using the PSLW corpus. We will report on the findings of one of these studies, which investigated the use of reporting verbs in students’ literature reviews. Using a framework drawing on the work of Francis, Hunston, and Manning (1996), Charles (2006), and Friginal (2013), the study showed that although L2 writers in the corpus used many verbs in the semantic categories of argue and show, mostly for textual attribution, they also employed more think verbs than advanced L1 student writers, particularly for making general statements or to express their own opinions. After discussing our research findings, we will end the presentation by offering implications of our project for corpus development and research in general.


Swatek, A., Banat, H., Staples, S. (2016, July). Developing First Year Composition L2 Writing Corpus: Research, Pedagogy and Teacher Training. Presentation at the 12th Teaching and Language Corpora Conference. Giessen, Germany.


Charles, M. (2006). Phraseological patterns in reporting clauses used in citation: A corpus-based study of theses in two disciplines. English for Specific Purposes 25(3). 310–331. doi:10.1016/j.esp.2005.05.003. Retrieved from 

Francis, G., Hunston, S., &  Manning, E. (Eds.). (1996). Collins COBUILD Grammar Patterns 1: Verbs. Amsterdam: John Benjamins Publishing Company.

Friginal, E. (2013). Developing research report writing skills using corpora. English for Specific Purposes 32(4). 208–220. doi:10.1016/j.esp.2013.06.001. Retrieved from 


Tagged with: , , , , ,

In March 2017, three conferences Crow researchers are very interested in will be held consecutively in the Pacific Northwest. (Four if you count ATTW!) We’re excited about the opportunity to attend, present (we hope), and participate in workshops and other ways. Earlier this week, we submitted two proposals for CCCC 2017. We’ve included summaries below.

Hope to see you in Portland and Seattle!

Cultivating Writing Research via Corpus and Computational Collaboration

Bill Hart-Davidson & Ryan Omizo will join Shelley Staples and Lindsey Macdonald for this panel. Here’s the opening statement:

In March 2017, CCCC will be joined in Portland by AAAL, the conference of the American Association for Applied Linguistics. We take this opportunity to highlight the value of collaboration between researchers who will be attending one, but likely not both, of these conferences, and unfortunately, crossing paths in few ways. The corpus linguistics methods common in applied linguistics can bring quantitative elements to empirical research in rhetoric and composition, including attention to demographic issues and diverse genres. Rhetorical research, conversely, offers corpus researchers valuable insights into extra-textual features and contextual influences. This panel explores possibilities for collaborative writing research by demonstrating the value of this interdisciplinary work. We offer an overview of the benefits of corpus and computational methods, then present case studies of two projects which integrate computational methods and corpus linguistics with rhetoric and composition. We conclude with a brief panel discussion of takeaways for interdisciplinary collaboration, then invite conversation.

Promoting RAD Writing Research through Inter-Institutional Collaboration

Michelle McMullin, Terrence Wang, and Bradley Dilger proposed this session. Here are some excerpts from the proposal:

Empirical research in composition and rhetoric has become more common. Diverse research projects investigate all areas of the field, including writing transfer, undergraduate writing majors, and the literacies of working class and underrepresented minorities. But scholar-teachers at all levels still struggle to implement lessons from published research at their own institutions, and to explain the relevance of research to administrators…. In this presentation, we describe how research designed as inter-institutional from its inception has embedded attention to diverse research outcomes, the development of sustainable infrastructures, and the lifecycle model of scalable user-centered development. Our project brings the methods of corpus linguistics to rhetoric and composition, and vice-versa, creating a web-based archive for research and professional development. By embedding an interdisciplinary approach to collaboration from the start, we have developed a project that considers the strengths and contributions of each partner for an effective collaboration model that best serves the needs of all stakeholders.

Tagged with: , , , , , , ,

At the end of our first academic year, the Crowbirds got together at Bradley’s house for a picnic, barbeque, and conversation. Madelyn and Amelia decorated, everyone brought wonderful food, and we had a great time — as you can see!


We are very proud of the progress we’ve made this year. A lot of our team members are traveling, and we’ll miss them. We look forward to a productive summer.

Tagged with: , ,