Corpus and Repository of Writing

Article: Corpus-informed instruction of reporting verbs

Crow team members Ji-young Shin, Ashley Velázquez, Ola Swatek, and Shelley Staples are pleased to announce the publication of an article with Purdue alum R. Scott Partridge. This is the first of several we expect to see coming out of our work with reporting verbs.

Examining the Effectiveness of Corpus-Informed Instruction of Reporting Verbs in L2 First-Year College Writing” appears in L2 Journal, issue 10.3. Here’s the abstract:

Previous research has shown that developing second language (L2) academic writers use a limited set of reporting verbs in comparison to more advanced writers (Biber & Reppen, 1998; Hinkel, 2003; Kwon, Staples, & Partridge, 2018; Neff et al., 2003; Staples & Reppen, 2016). These writers also tend to rely on verbs that are typical for conversation (Biber et al., 1999). The present study examines the effects of corpus-informed instruction on developing L2 writers’ learning of reporting verbs in a first-year writing course by comparing drafts of literature reviews before and after a workshop. The forty-five-minute workshop was designed to improve L2 writers’ lexical and functional uses of reporting verbs using corpus-informed materials. The researchers compared the literature review drafts written by 40 students who participated in the workshop to 38 randomly chosen drafts from our corpus. The results show an increase in the experimental groups’ reporting verb lexical variety and a decrease in the use of verb types used in speech in favor of types used in academic writing. The results suggest that corpus-informed instruction may support L2 writers in the development of lexical and functional reporting verb use.

Well done Ji-Young, Ashley, Ola, Shelley, and Scott!

2018 Teaching and Language Corpora (TaLC) conference workshop

Researchers from the Crow team have returned to the 2018 Teaching and Language Corpora (TaLC) conference in Cambridge, England, to update the TaLC community on the progress we’ve made since our last appearance. Team members (Ola Swatek, Hadi Banat, and Shelley Staples) had previously attended the 12th TaLC conference in 2016 to present our plans for developing the Crow platform, and this year we were actually able to debut the working prototype. This year approximately 25 conference delegates (as the local organizers called us) had the opportunity to attend our half-day workshop titled “Exploring variation and intertextuality in L2 undergraduate writing in English: Using the Corpus and Repository of Writing online platform for research and teaching.” The team guiding the workshop participants consisted of Adriana Picoral and Dr. Shelley Staples from the University of Arizona, alongside the following graduate students from Purdue University: Ji-young Shin, Aleksandra/Ola Swatek, and Zhaozhe Wang. Mark Fullmer joined us virtually from Tucson, Arizona.    The purpose of the workshop was to test out the Crow online interface with users who have never used it and collect user feedback from them. This was the first time that researchers and teachers outside of Crow had a chance to try the Crow online platform. This workshop provided us with the opportunity to get practical feedback from some of our target audience on the usability of Crow platform before our public release.

As we were preparing for the workshop, we discussed how different the audience at this conference might be from audiences at conferences in the United States. The data for our project is currently collected from first-year writing courses at our U.S. institutions. These courses could be unfamiliar to European audiences, which made it particularly important to describe the purpose of these courses to the audience.

The workshop began with Dr. Staples’ introducing the project to the attendees, who were mostly unfamiliar with our data and the educational context from which the data comes.

The introduction was followed by more hands-on activities since we are strong believers in the learning by doing approach. We have scaffolded our exercises to move from simple to more advanced functions so we promoted step-by-step, spiral learning. Ola started the first activity by introducing corpus search function,  and Adriana, Zhaozhe, and Ji-young guided the other activities one by one. Each time we were done with a part of the workshop, our team asked the participants to write down feedback related to a particular functionality of our interface. Such feedback was one of the key reasons we decided to beta-test the interface with this particular group of users. The workshop attendees, most of whom are fairly used to navigating other corpora interfaces, are one of the target users of our platform.

The main activities of the workshop included:

  • Introducing/practicing simple and advanced search and filter functions in Crow corpus
  • Applying corpus functions for pedagogical purposes
  • Introducing/practicing search and filter functions in Crow repository
  • Applying repository functions for pedagogical purposes, considering intertextuality between corpus and repository
  • Brainstorming research projects using Crow platform
  • Introducing/practicing how-to on coding tools and bookmarking corpus searches

During our workshop, the attendees asked many important questions about the current and future design of the Crow platform. These questions gave us insights into what some of our users might be interested in seeing the platform do in the future. Of particular interest were the search functions and the possibility to download corpus data with rich metadata.

Workshop participants noted that the search engine was “easy to use,” “intuitive,” and “accessible”—promising feedback, considering that we want our tool to be used not only by researchers but by teachers as well. Also, audience members  indicated interest in intertextuality opportunities offered by the link between repository and corresponding corpus materials for both research and pedagogical purposes. Most corpus tools publicly available don’t cater to teachers as audience; though this particular audience could certainly benefit from a clear and accessible corpora of classroom materials, they don’t always have the training necessary to navigate research-based corpus tools. Attendees also reportedly enjoyed the design of the landing page, as well as the detailed nature of the repository and its interlinked documents.

We also gathered valuable feedback on what the audience would like to see improved in the platform. Among some of the suggestions was distinguishing corpus and repository interfaces with color scheme, improving the concordancing view, changing the font size and color, improving the search function. There were many more comments added in the survey we conducted and we plan to use that knowledge to make the interface as user friendly as possible.

The TaLC experience was not only focused on academic interactions. The city of Cambridge has much to offer in terms of its beautiful scenery, rich history, and unique culture. Graduate students of our team went on a punting tour and got to learn the cultural and social history behind the most prominent colleges and their students at Cambridge University. It was a truly memorable experience.

Here’s a video prepared by the organizers showing what you’ve missed, if you haven’t attended the Teaching and Language Corpora conference.


Workshop 2: Thinking Like a Programmer  

In this series of posts, we reflect on the Methodology Workshop for Natural Language Programming workshop, coordinated by Crow developer Mark Fullmer and hosted at Purdue.

In our first session of the workshop, our Crow team gained a basic understanding of how to approach coding. The principles we took away from the session gave us a solid theoretical framework with which to build practical coding skills. We’ll be learning Python given its simplicity, flexibility, and adaptation to the text processing which is a part of Crow research.

Principle 1: Automate everything. Automating as much of our data as possible not only increases our efficiency and accuracy, it also gives us the ability to plug pre-programmed segments into future projects with minimal fuss or extra effort.

Principle 2: Separation of concerns. Like many things in life, programming is easier to do when broken down into small steps, each one performing a different function like assembly lines in a factory. Step by step, we worked through an example to consider the process of writing code. First, we separated words into individual entities using a delimiter, such as a comma. Next, we scanned the words through the computer system, issuing a frequency count for each word. Lastly, we displayed the list for frequency analysis. Splicing our code into separate “factories” provides two advantages: 1) the code can easily be recycled for future programs. It is much more efficient to tweak small segment of code then rewrite an entire program. 2) It’s easier to test the accuracy of our program when its broken up into short code segments. Simply modify your test when you want to reach a different result for that portion of code.

Principle 3: Don’t make assumptions. We learned that creating code on the assumption that the results we need now will be the same results we need tomorrow is a crucial mistake. For instance, hardwiring a text processor to remove all apostrophes will make it useless if down the road we need to analyze possessive nouns. Instead, it is better to create an optional “factory” that can be removed or upgraded to obtain the desired result. Also, we shouldn’t assume that a computer can read the text in its current format. Elements such as capitalization, punctuation, and character encoding are not read by computers the way we read them and must be cleaned from a text before it can be analyzed.

Principle 4: Avoid hardcoding. When labeling our different “factories” we should leave room for flexibility or else the name won’t match the function when we make changes. We must maintain a balance between generalization and specificity.

Principle 5: Keep it simple, stupid. A hallmark of good Pythonic code is that the simplest methods are used even if it requires writing more code.

Principles 6 and 7: Convention over configuration, and Write your code for the next programmer. Following standardized coding formats will make our code more accessible to other programmers than if we personalize code to our own preference. To further help other programmers decipher our work, it is helpful to add inline commentary that explains difficult aspects of our code and to follow the formulaic syntactic already in existence.

Principle 8: Don’t repeat ourselves. The goal with coding is to reuse and recycle. Instead of rewriting a slightly different version of the same code for multiple different programs, we should write the code once then modify it for different uses. Writing code is much like creating a résumé: build one, but tailor multiple drafts to different employers.

Principle 9: We don’t want to write new code unless it’s absolutely necessary. More than likely, someone else has already written the code we need, so there is no point in reinventing something we can borrow. This principle is the programmer’s version of “think smarter not harder.”

We concluded our session by discussing these principles and articulating them in other ways to see how much we took understood the principles. After learning more about the coding process, Crow members felt confident to move forward into writing actual Python code. More on that in our next post!

Kickoff: Methodology Workshop at Purdue

The Crow team recently concluded a four-day workshop series, Methodology Workshop for Natural Language Programming. The workshops, led by Crow researcher and software developer Mark Fullmer, were designed to equip our team members with the fundamental coding and programming skills needed to construct our own programs and troubleshoot problems we encounter in existing scripts. By obtaining a functional knowledge of programming, we can meet our goals to make Crow sustainable and increase team member contribution to corpus- and interface-building tasks. At the end of the week, we expected all Crow members to (1) build a working vocabulary of coding terms; (2) progress past the introductory threshold of programming; and (3) better understand and articulate programming challenges we encounter as we integrate our corpus and repository.

Crow programmer Mark Fullmer presenting to researchers

Mark Fullmer opening the technical workshop

To maximize our learning and productivity during the rest of the week, Mark led the Crow team in an assessment of our current programming skills, identifying what threshold of competency each person wished to achieve by the conclusion of the workshops, and establishing a framework for researchers to form their own personal learning objectives. Mark gave us a checklist of coding tasks to measure against our current programing knowledge and help us compare our progress against a list of definable expectations. Talking over the tasks we were already performing in Crow revealed the varying levels of coding experience among team members, and Mark encouraged us to pick and choose workshops that we would find most useful. During our brainstorming, we created a running document listing the different aspects of programming we found most difficult and specific problems we had encountered. Crow researchers continually updated this document and others throughout the workshop, and we’ll be sharing them soon.

After evaluating our programming competence and articulating our short and long-term goals for the workshops, Mark gave us a preview of the week’s work. The three mantras for the rest of the week were: (1) text processing is recursive and will almost always require future modification; (2) code is an inherently disposable entity that we use to accomplish a specific task; (3) if it isn’t documented, then it doesn’t exist in code.

Participation by our collaborators at Arizona was facilitated by Google Hangouts on Air, a fabulous tool which also records videos of the workshops we can review, edit, and post online.

Over the next month or so, we’ll offer a series of posts which recap the workshop and help us think about ways to develop it into a resource which the Crow community can use as we work together to build the Crow web interface.

Promoting citation research at AAAL 2018

Researchers from the Crow team presented “Citation practices of L2 writers in first-year writing courses: form, function, and connection with pedagogical materials” at AAAL 2018. The presenters were Wendy Jie Gao, Lindsey Macdonald, Zhaozhe Wang, Adriana Picoral and Dr. Shelley Staples.

Crow Citation Team at AAAL 2018

Dr. Shelley Staples, Adriana Picoral, Lindsey Macdonald, Wendy Jie Gao, and Zhaozhe Wang


Citation practices and styles are integral to academic writing contexts. Previous research on citation use has focused on variability across citation form (e.g., integral/non-integral) and function (e.g., synthesis/summary) (Charles, 2006; Petric, 2007; Swales, 2014). However, most studies have focused on advanced L1 English student and professional writing. In addition, no studies to date have investigated the influence of instructor materials on students’ citation practices. Using a corpus of L2 writing, we examined (1) how the L2 writers’ citations vary in form and function across different assignments and instructors; (2) how students’ citation practices might be influenced by the pedagogical materials provided for each assignment.

Our corpus includes 74 papers (72,395 words) across two assignments, a literature review (LR) and a research paper (RP), from a first-year writing course for L2 writers. We calculated the number of citations and references in each assignment (per 1,000 words), and coded citations for integral, non-integral or hybrid (integral and non-integral) forms. We then coded citation functions based on Petric (2007) and qualitatively examined the relationship of the writing to pedagogical materials, such as the number of sources required and the form and function of citations in sample papers.

Our preliminary results show that the writers most frequently use integral citations with little synthesizing function. While there is a large variation in the number of citations both within assignments and across instructors (LR: 3.54-9.73, RP: 5.25-7.05), the number of references is more consistent in the literature review (LR: 2.66-3.27, RP: 3.72-4.88). Students prefer a citation style of non-quote to quote. Integral citation is more frequently used in the literature review, while non-integral citation appears more in the research paper. Hybrid citation form is consistently in existence almost across all sections. These results might be attributed to instructors’ use of model literature review papers that almost exclusively feature integral citations, as well as explicit requirements (3 sources) in the assignment sheets. Attribution only is the largest category for rhetorical functions of all the citations. In addition, students’ awareness of establishing links between sources and making statement of use seem to have been influenced by sample papers. Our findings show the potential need for more instruction on the use of sources for synthesizing information, and the important influence of pedagogical materials.

Citation project conference handout (PDF).

Selected References

Charles, M. (2006). Phraseological patterns in reporting clauses used in citation: A corpus-based study of theses in two disciplines. English for Specific Purposes25(3), 310–331. doi:10.1016/j.esp.2005.053

Lee, J. J., Hitchcock, C. & Casal, J. E., (2018), Citation practices of L2 university students in first-year writing: Form, function and stance. English for Specific Purposes, 33, 1-11.

Petrić, B. (2007). Rhetorical functions of citations in high- and low-rated master’s theses. Journal of English for Academic Purposes, 6(3), 238-253.

Swales, J. (2014). Variation in citational practice in a corpus of student biology papers from parenthetical plonking to intertextual storytelling. Written Communication, 31(1), 118–141. doi:10.1177/0741088313515166

Friday Tech Talk on Word And Phrase

On February 23, 2018, members of the University of Arizona Corpus Lab, Dr. Shelley Staples and Adriana Picoral, held a Friday Tech Talk demonstrating the Word And Phrase application.   The focus of these weekly talks, which are organized by the iSpace at University of Arizona, is on eliciting conversations around different types of digital tools. The targeted tool for this workshop (Word and Phrase) pulls data from the BYU Corpora (in English, Spanish, and Portuguese), allowing users to search new and pre-existing texts, color coding each word based on its frequency.  There are three frequency ranges that the application searches for based on word usage within the corpora; 1-500 (blue), 501-3000 (green), and >3000 (yellow).  

(Academic text sample color-coded by word frequency.)

Word frequency can also be separated by genre: Spoken, Fiction, Magazine, Newspaper, and Academic.  This feature allows instructors to illustrate to their students which types of speech appear in which genres; for example the pronoun ‘I’ is found more frequently in Spoken and Fiction genres, as opposed to Academic writing where it is least likely to be used.  The application identifies the part of speech, ranking, frequency, collocates, and synonyms for each word within the top 3000 words frequency range; Word and Phrase allows students to explore when and how to use specific words or phrases based on information from the BYU corpora as well as other resources (such as Wordnet).

(Frequency of the pronoun ‘I’ across genres.)

(Ranking and frequency of the word ‘say’ as each possible PoS (Part of Speech).)

(Concordance lines of the word ‘tell’ as collocations, providing definition, PoS, and synonyms.)

Participants gave positive feedback on synonyms provided by the word search tool, where more and less frequent synonyms to the search word are displayed with some information on meaning variation provided. They also noted that with these tools, students are able to access the program on their own for autonomous learning.

Here’s our handout on using Word And Phrase.

For information on other Tech Talks organized by the iSpace at University of Arizona, please visit

Tagged with:

Crow Workshop: Integrating AntConc into Teacher Curriculum

By Kelly Marshall and the AZ Crow Team

On February 17, 2018, the University of Arizona Corpus Lab hosted an introductory workshop on how to use AntConc at the 17th Annual SLAT Interdisciplinary Roundtable. The workshop was lead by Adriana Picoral, Nicole Schmidt, Curtis Green, and Shelley Staples, with help from Kelly Marshall, Ali Yaylali, Nik Kirstein, and Yingliang Liu. For this workshop, we changed the layout of our last workshop to better fit the needs and purposes of the attendees at this conference. The first notable change was the use of two different corpora: Arizona Second Language Writing Corpus (ASLW) (part of Crow) and Spanish Learner Language Oral Corpora (SPLLOC). The components we used from the ASLW corpus included Narrative and Rhetorical Analysis student-written papers, while the components we used from the SPLLOC corpus were Modern Times Narrative and Photo Interview files. The goal of the workshop was to help instructors understand how to use AntConc, and how to integrate the application and results into their pedagogy. This was different from our last workshop presentation (given Nov. 21, 2017) where we focused exclusively on the ASLW (Crow) since our audience for that workshop was instructors in the UA Writing Program.

Other differences included the space the workshop was in as well as the activities. The workshop was hosted in one of the computer labs in the Modern Languages building. This room allowed for all workshop participants to interact, learn, and explore the AntConc program instead of having to share with another participant like last time. However, since the time slot was only an hour and fifteen minutes (rather than the hour and forty five minutes allotted last semester), we condensed the workshop by covering terms during the activities rather than presenting them at the beginning. The other aspect that was condensed was the number of activities participants completed, from five activities to three. This was also done to allow participants, like last time, to independently explore the program, interact with one another, and ask us questions they had after completing the activities.

Before the workshop, we ensured all computers had the AntConc application and the appropriate corpora files in Spanish and English were downloaded. This allowed us to save time and start the workshop promptly, without having to spend the first part of the session instructing participants how to download and access the files and program. This pre-workshop preparation process was necessary because we did not know who the participants were in advance (so we were unable to contact them with instructions on how to access the data). In the future, our corpus data will be more easily accessible through a website, which will facilitate this process.

During the workshop, participants were taught how to hide tags so personal, instructor, and other course related information included in the student papers were not displayed in the results.  It should be noted that a potential problem with hiding tags is that the output will be limited in the concordance function. Although we did not introduce this issue at the beginning of the workshop, we showed participants how to solve this problem when we presented activities using the concordance function (i.e., unhide tags if more text is desired). The activities focused on instructing participants to search for specific words or N-Grams (contiguous sequences of words, e.g., 1-gram, 2-gram, 3-gram), and how to see these in a list, in the Word List function, or as key words in context (KWIC) in the Concordance Function.   

(KWIC concordance results with tags included.)

(KWIC concordance results with tags hidden.)

When searching in the concordance window, those in the workshop were taught how to select window size, and to search by frequency, range, or word.  Using the KWIC search shows the words 1, 2, or 3 places left or right of the key word. In addition, participants were taught how to search by prefixes and suffixes, or locate citations by searching “(*)”.

(N-Grams sorted by range to show the most common n-grams across all uploaded files.)

While there were notable differences between the two workshops, both had the underlying goal of providing instructors a new approach to create materials and illustrate the pragmatic use of lexical items and grammar in order to show their students the contexts and patterns of words within a specific genre. Moreover, throughout both workshops, we asked participants questions and had a conversation with participants regarding how AntConc could be used to provide authentic writing examples and address common error patterns.

The workshop concluded with a discussion, first in small groups and then with the entire group, about how these methods translate into lessons. The teachers were given time to reflect on how they might use what they had learned in their own pedagogy.

Here’s our AntConc handout from the workshop.

Tagged with:

Citation Study Update

Members of the interdisciplinary Crow team have been working on what we’ve been calling internally our “Citation Project” since the Summer of 2017. This name is our homage to The Citation Project conducted by Rebecca Moore Howard and Sandra Jamieson.

Wendy Jie Gao, Lindsey Macdonald, and Terrence Zhaozhe Wang videoconference with Shelley Staples and Adriana Picoral.

When the project research was first presented at Corpus Linguistics in 2017, it was titled ”Variability in Citation Practices of Developing L2 writers in First-Year Writing Courses”.  The purpose of the study can be stated as follows: “By examining L2 students’ citation practices in their assignments (Literature Review and Research Paper) for an introductory writing course, we explored their preference for particular citation styles and possible variance across assignments and instructors.”

At the current time, our research focuses on what we’re calling citations and non-citations, as well as the various forms and functions of the citations students are using in two genres: literature reviews and argumentative essays. All of the documents used for the project are from the Purdue Crow Second Language Writing corpus, and a total of 132 papers and 147,000 words have been analyzed. We are examining many different styles of citations, including quote and non-quote, as well as integral and non-integral. An integral citation includes the author’s or article’s name in the sentence being cited. For a non-integral citation, the author’s or article’s stated name is in parenthesis at the end of the sentence. A non-citation doesn’t explicitly state the name of the author or article.

Our findings revealed that students use more citations in a research paper than a literature review and they have a preference for integral citations especially in a literature review. Most importantly, we discovered student’s work is highly framed around sample papers that the instructors provide for students.

Our team plans on presenting their research on March 27 at the AAAL 2018 conference (9:10 to 9:40am, Arkansas Room). We hope to grow the amount of documents which are a part of the project in order to expand the knowledge it can provide.

Spotlighting Crow Undergrad Interns

The Crow team is composed of a variety of different scholars at many different levels of academia from many different fields.  Crow includes various professors of writing, ESL, EAL, SLAT, and many other areas of English and language.  On top of this, Crow also includes three undergraduate interns which broadly expands their experience by introducing them to many workplace aspects such as a collaborative work environment, research opportunities, and more! Each of the undergraduate interns became a part of crow for different reasons and hope to further pursue their academic career through the experience gained here. Below each intern explains how they first became involved with Crow and what experiences they hope to gain from this internship opportunity. 


Nik Kirstein: Nik Kirstein is a junior in Information Science.  He first got interested in Crow after working with a corpus to analyze the Russian language.  Crow helps Nik gain experience in text and data processing and has introduced him to some corpus informatics applications such as AntConc.   All of this ties into information science very well.  Nik hopes to gain more experience in data visualization and back end database development with corpus data.  He wants to work in the CyberSecurity Industry one day.


A picture of Blair NewtonBlair Newton: Blair Newton is a senior in Professional Writing. She first heard about Crow from her Intro to Professional Writing professor, Dr. Michael Salvo. The internship opportunity appealed to her because of how much varying experience she would be exposed to that classes couldn’t offer. Blair does research, blog posts, grant writing, and graphic design for Crow. She hopes one day to combine writing and marketing as a career and eventually even write a novel.



A picture of Jessica KuklaJessica Kukla: Jessica Kukla is a senior professional writing major on the editing and publishing track at Michigan  State. While writing and editing is her forte, Jessica has a growing interest in technical writing and information and experience architecture, which lead her to working with Crow. She hopes to gain more experience with grant writing and working with corpus data. After MSU, Jessica hopes to pursue higher education in something along the lines of information architecture.


By Nik Kirstein, Blair Newton, & Jessica Kukla 

Arizona AntConc Workshop 2017

On November 10th, 2017, the University of Arizona Corpus Lab held its AntConc Workshop.  AntConc is an application that allows users to view useful information about a text such as the word frequency, placement of search term in the text, and more. The main goal of the workshop was to help instructors 1) to develop an understanding of how to use Crow and AntConc to address language awareness within their writing classroom, and 2) to understand the value of using students’ writing. Corpus data offers a new way to look at learning English as a second language. Using a corpus, instructors can see what common mistakes their students make or what patterns of language are more common in certain genres, and then create activities based around them. For example, when the instructors searched for parentheses, the locations of the citations within the papers could be seen in the concordance plot. This helped instructors see how their students were using citations and whether or not they were being used correctly.  Another idea during the workshop was comparing the use of the word “like” in written papers versus spoken English. The differences in writing and speech help us understand how these students are learning and understanding English. The workshop was a great success and the materials as well as a video cast will be available soon.

Photo from AntConc Workshop