Corpus and Repository of Writing

Researchers from the Crow team have returned to the 2018 Teaching and Language Corpora (TaLC) conference in Cambridge, England, to update the TaLC community on the progress we’ve made since our last appearance. Team members (Ola Swatek, Hadi Banat, and Shelley Staples) had previously attended the 12th TaLC conference in 2016 to present our plans for developing the Crow platform, and this year we were actually able to debut the working prototype. This year approximately 25 conference delegates (as the local organizers called us) had the opportunity to attend our half-day workshop titled “Exploring variation and intertextuality in L2 undergraduate writing in English: Using the Corpus and Repository of Writing online platform for research and teaching.” The team guiding the workshop participants consisted of Adriana Picoral and Dr. Shelley Staples from the University of Arizona, alongside the following graduate students from Purdue University: Ji-young Shin, Aleksandra/Ola Swatek, and Zhaozhe Wang. Mark Fullmer joined us virtually from Tucson, Arizona.    The purpose of the workshop was to test out the Crow online interface with users who have never used it and collect user feedback from them. This was the first time that researchers and teachers outside of Crow had a chance to try the Crow online platform. This workshop provided us with the opportunity to get practical feedback from some of our target audience on the usability of Crow platform before our public release.

As we were preparing for the workshop, we discussed how different the audience at this conference might be from audiences at conferences in the United States. The data for our project is currently collected from first-year writing courses at our U.S. institutions. These courses could be unfamiliar to European audiences, which made it particularly important to describe the purpose of these courses to the audience.

The workshop began with Dr. Staples’ introducing the project to the attendees, who were mostly unfamiliar with our data and the educational context from which the data comes.

The introduction was followed by more hands-on activities since we are strong believers in the learning by doing approach. We have scaffolded our exercises to move from simple to more advanced functions so we promoted step-by-step, spiral learning. Ola started the first activity by introducing corpus search function,  and Adriana, Zhaozhe, and Ji-young guided the other activities one by one. Each time we were done with a part of the workshop, our team asked the participants to write down feedback related to a particular functionality of our interface. Such feedback was one of the key reasons we decided to beta-test the interface with this particular group of users. The workshop attendees, most of whom are fairly used to navigating other corpora interfaces, are one of the target users of our platform.

The main activities of the workshop included:

    • Introducing/practicing simple and advanced search and filter functions in Crow corpus
    • Applying corpus functions for pedagogical purposes
    • Introducing/practicing search and filter functions in Crow repository
    • Applying repository functions for pedagogical purposes, considering intertextuality between corpus and repository
    • Brainstorming research projects using Crow platform
  • Introducing/practicing how-to on coding tools and bookmarking corpus searches

During our workshop, the attendees asked many important questions about the current and future design of the Crow platform. These questions gave us insights into what some of our users might be interested in seeing the platform do in the future. Of particular interest were the search functions and the possibility to download corpus data with rich metadata.

Workshop participants noted that the search engine was “easy to use,” “intuitive,” and “accessible”—promising feedback, considering that we want our tool to be used not only by researchers but by teachers as well. Also, audience members  indicated interest in intertextuality opportunities offered by the link between repository and corresponding corpus materials for both research and pedagogical purposes. Most corpus tools publicly available don’t cater to teachers as audience; though this particular audience could certainly benefit from a clear and accessible corpora of classroom materials, they don’t always have the training necessary to navigate research-based corpus tools. Attendees also reportedly enjoyed the design of the landing page, as well as the detailed nature of the repository and its interlinked documents.

We also gathered valuable feedback on what the audience would like to see improved in the platform. Among some of the suggestions was distinguishing corpus and repository interfaces with color scheme, improving the concordancing view, changing the font size and color, improving the search function. There were many more comments added in the survey we conducted and we plan to use that knowledge to make the interface as user friendly as possible.

The TaLC experience was not only focused on academic interactions. The city of Cambridge has much to offer in terms of its beautiful scenery, rich history, and unique culture. Graduate students of our team went on a punting tour and got to learn the cultural and social history behind the most prominent colleges and their students at Cambridge University. It was a truly memorable experience.

Here’s a video prepared by the organizers showing what you’ve missed, if you haven’t attended the Teaching and Language Corpora conference.

Interested in our other workshops? Check out our workshops page.

Midway through our programming workshop series, Methodology Workshop for Natural Language Programming workshop, the Crow team took a step back to reflect on what direction we wanted to take in our work and the role programming and coding would play in our long-term mission. Much of our discourse revolved around our identity as a research group, specifically concerns about creating a self-sustaining organization that can adapt to change.

Our discussion started with questions of achievement:

  • What tasks must we perform to accomplish our current milestone such as TALC and the Symposium. What additional milestones do we want to carry out?
  • How do we differentiate between internal resources, tools, and deliverables and external materials that could be shared with our partners?

From there, we generalized into more existential questions, such as our identity as a team and how that identity has changed because of these workshops.

  • What does it mean to be a Crow researcher?
  • What criteria will future Crow members need to meet?
  • Will the coding and programming skills covered during this conference be a standard expectation for all team members or only a select few?
  • How do we train incoming members?

To address these issues, we discussed creating personas of the different positions within the Crow team and use them to generate a specific set of criteria for each position. Our discussion then segued into the purpose of existing Crow members and how their personal goals intersect with team goals. Should current members be involved with Crow until the end of their PhD studies? How big of a role should fourth year students play in Crow? Bradley shared his view that all fourth-year students should be applying for fellowships in their specific fields of expertise and members who acquire fellowships should be helpfully “fired” and allowed to pursue these new opportunities. However, a great deal will depend on how much the student’s personal objective dovetailed with Crow research.

To that end, further dialogue was devoted to the need for more Crow meetings and greater articulation of member goals. A prime area of concern was balancing the coordination of internal management with the myriad of other obligations on the Crow agenda. How do we sustain the succession of new students when current members leave?  Financially, how do we continue our work if our grant requests are not successful? (The idea of counterfeiting was briefly considered but unanimously vetoed). Furthermore, What are the next steps once we achieve the creation of a working corpus? Do we want our research to simply be open sourced or should we create a service model where access to Crow is free but users would be charged for support services from Crow members?

As the session concluded, we decided that questions regarding preparation for TALC would be revisited at the summit meeting. Topics of immediate concern, such as how to transition workflow once students graduate and revisiting our one, two, and five-year plans, were scheduled for future discussion.

In this series of posts, we reflect on the Methodology Workshop for Natural Language Programming workshop, coordinated by Crow developer Mark Fullmer and hosted at Purdue.

In our first session of the workshop, our Crow team gained a basic understanding of how to approach coding. The principles we took away from the session gave us a solid theoretical framework with which to build practical coding skills. We’ll be learning Python given its simplicity, flexibility, and adaptation to the text processing which is a part of Crow research.

Principle 1: Automate everything. Automating as much of our data as possible not only increases our efficiency and accuracy, it also gives us the ability to plug pre-programmed segments into future projects with minimal fuss or extra effort.

Principle 2: Separation of concerns. Like many things in life, programming is easier to do when broken down into small steps, each one performing a different function like assembly lines in a factory. Step by step, we worked through an example to consider the process of writing code. First, we separated words into individual entities using a delimiter, such as a comma. Next, we scanned the words through the computer system, issuing a frequency count for each word. Lastly, we displayed the list for frequency analysis. Splicing our code into separate “factories” provides two advantages: 1) the code can easily be recycled for future programs. It is much more efficient to tweak small segment of code then rewrite an entire program. 2) It’s easier to test the accuracy of our program when its broken up into short code segments. Simply modify your test when you want to reach a different result for that portion of code.

Principle 3: Don’t make assumptions. We learned that creating code on the assumption that the results we need now will be the same results we need tomorrow is a crucial mistake. For instance, hardwiring a text processor to remove all apostrophes will make it useless if down the road we need to analyze possessive nouns. Instead, it is better to create an optional “factory” that can be removed or upgraded to obtain the desired result. Also, we shouldn’t assume that a computer can read the text in its current format. Elements such as capitalization, punctuation, and character encoding are not read by computers the way we read them and must be cleaned from a text before it can be analyzed.

Principle 4: Avoid hardcoding. When labeling our different “factories” we should leave room for flexibility or else the name won’t match the function when we make changes. We must maintain a balance between generalization and specificity.

Principle 5: Keep it simple, stupid. A hallmark of good Pythonic code is that the simplest methods are used even if it requires writing more code.

Principles 6 and 7: Convention over configuration, and Write your code for the next programmer. Following standardized coding formats will make our code more accessible to other programmers than if we personalize code to our own preference. To further help other programmers decipher our work, it is helpful to add inline commentary that explains difficult aspects of our code and to follow the formulaic syntactic already in existence.

Principle 8: Don’t repeat ourselves. The goal with coding is to reuse and recycle. Instead of rewriting a slightly different version of the same code for multiple different programs, we should write the code once then modify it for different uses. Writing code is much like creating a résumé: build one, but tailor multiple drafts to different employers.

Principle 9: We don’t want to write new code unless it’s absolutely necessary. More than likely, someone else has already written the code we need, so there is no point in reinventing something we can borrow. This principle is the programmer’s version of “think smarter not harder.”

We concluded our session by discussing these principles and articulating them in other ways to see how much we took understood the principles. After learning more about the coding process, Crow members felt confident to move forward into writing actual Python code. More on that in our next post!

The Crow team recently concluded a four-day workshop series, Methodology Workshop for Natural Language Programming. The workshops, led by Crow researcher and software developer Mark Fullmer, were designed to equip our team members with the fundamental coding and programming skills needed to construct our own programs and troubleshoot problems we encounter in existing scripts. By obtaining a functional knowledge of programming, we can meet our goals to make Crow sustainable and increase team member contribution to corpus- and interface-building tasks. At the end of the week, we expected all Crow members to (1) build a working vocabulary of coding terms; (2) progress past the introductory threshold of programming; and (3) better understand and articulate programming challenges we encounter as we integrate our corpus and repository.

Crow programmer Mark Fullmer presenting to researchers

Mark Fullmer opening the technical workshop

To maximize our learning and productivity during the rest of the week, Mark led the Crow team in an assessment of our current programming skills, identifying what threshold of competency each person wished to achieve by the conclusion of the workshops, and establishing a framework for researchers to form their own personal learning objectives. Mark gave us a checklist of coding tasks to measure against our current programing knowledge and help us compare our progress against a list of definable expectations. Talking over the tasks we were already performing in Crow revealed the varying levels of coding experience among team members, and Mark encouraged us to pick and choose workshops that we would find most useful. During our brainstorming, we created a running document listing the different aspects of programming we found most difficult and specific problems we had encountered. Crow researchers continually updated this document and others throughout the workshop, and we’ll be sharing them soon.

After evaluating our programming competence and articulating our short and long-term goals for the workshops, Mark gave us a preview of the week’s work. The three mantras for the rest of the week were: (1) text processing is recursive and will almost always require future modification; (2) code is an inherently disposable entity that we use to accomplish a specific task; (3) if it isn’t documented, then it doesn’t exist in code.

Participation by our collaborators at Arizona was facilitated by Google Hangouts on Air, a fabulous tool which also records videos of the workshops we can review, edit, and post online.

Over the next month or so, we’ll offer a series of posts which recap the workshop and help us think about ways to develop it into a resource which the Crow community can use as we work together to build the Crow web interface. Stay tuned for post two!

Researchers from the Crow team presented “Citation practices of L2 writers in first-year writing courses: form, function, and connection with pedagogical materials” at AAAL 2018. The presenters were Wendy Jie Gao, Lindsey Macdonald, Zhaozhe Wang, Adriana Picoral and Dr. Shelley Staples.

Crow Citation Team at AAAL 2018

Dr. Shelley Staples, Adriana Picoral, Lindsey Macdonald, Wendy Jie Gao, and Zhaozhe Wang


Citation practices and styles are integral to academic writing contexts. Previous research on citation use has focused on variability across citation form (e.g., integral/non-integral) and function (e.g., synthesis/summary) (Charles, 2006; Petric, 2007; Swales, 2014). However, most studies have focused on advanced L1 English student and professional writing. In addition, no studies to date have investigated the influence of instructor materials on students’ citation practices. Using a corpus of L2 writing, we examined (1) how the L2 writers’ citations vary in form and function across different assignments and instructors; (2) how students’ citation practices might be influenced by the pedagogical materials provided for each assignment.

Our corpus includes 74 papers (72,395 words) across two assignments, a literature review (LR) and a research paper (RP), from a first-year writing course for L2 writers. We calculated the number of citations and references in each assignment (per 1,000 words), and coded citations for integral, non-integral or hybrid (integral and non-integral) forms. We then coded citation functions based on Petric (2007) and qualitatively examined the relationship of the writing to pedagogical materials, such as the number of sources required and the form and function of citations in sample papers.

Our preliminary results show that the writers most frequently use integral citations with little synthesizing function. While there is a large variation in the number of citations both within assignments and across instructors (LR: 3.54-9.73, RP: 5.25-7.05), the number of references is more consistent in the literature review (LR: 2.66-3.27, RP: 3.72-4.88). Students prefer a citation style of non-quote to quote. Integral citation is more frequently used in the literature review, while non-integral citation appears more in the research paper. Hybrid citation form is consistently in existence almost across all sections. These results might be attributed to instructors’ use of model literature review papers that almost exclusively feature integral citations, as well as explicit requirements (3 sources) in the assignment sheets. Attribution only is the largest category for rhetorical functions of all the citations. In addition, students’ awareness of establishing links between sources and making statement of use seem to have been influenced by sample papers. Our findings show the potential need for more instruction on the use of sources for synthesizing information, and the important influence of pedagogical materials.

Citation project conference handout (PDF).

Selected References

Charles, M. (2006). Phraseological patterns in reporting clauses used in citation: A corpus-based study of theses in two disciplines. English for Specific Purposes25(3), 310–331. doi:10.1016/j.esp.2005.053

Lee, J. J., Hitchcock, C. & Casal, J. E., (2018), Citation practices of L2 university students in first-year writing: Form, function and stance. English for Specific Purposes, 33, 1-11.

Petrić, B. (2007). Rhetorical functions of citations in high- and low-rated master’s theses. Journal of English for Academic Purposes, 6(3), 238-253.

Swales, J. (2014). Variation in citational practice in a corpus of student biology papers from parenthetical plonking to intertextual storytelling. Written Communication, 31(1), 118–141. doi:10.1177/0741088313515166

On February 23, 2018, members of the University of Arizona Corpus Lab, Dr. Shelley Staples and Adriana Picoral, held a Friday Tech Talk demonstrating the Word And Phrase application.  The focus of these weekly talks, which are organized by the iSpace at University of Arizona, is on eliciting conversations around different types of digital tools. The targeted tool for this workshop (Word and Phrase) pulls data from the BYU Corpora (in English, Spanish, and Portuguese), allowing users to search new and pre-existing texts, color coding each word based on its frequency.  There are three frequency ranges that the application searches for based on word usage within the corpora; 1-500 (blue), 501-3000 (green), and >3000 (yellow).  

(Academic text sample color-coded by word frequency.)

Word frequency can also be separated by genre: Spoken, Fiction, Magazine, Newspaper, and Academic.  This feature allows instructors to illustrate to their students which types of speech appear in which genres; for example the pronoun ‘I’ is found more frequently in Spoken and Fiction genres, as opposed to Academic writing where it is least likely to be used.  The application identifies the part of speech, ranking, frequency, collocates, and synonyms for each word within the top 3000 words frequency range; Word and Phrase allows students to explore when and how to use specific words or phrases based on information from the BYU corpora as well as other resources (such as Wordnet).

(Frequency of the pronoun ‘I’ across genres.)

(Ranking and frequency of the word ‘say’ as each possible PoS (Part of Speech).)

(Concordance lines of the word ‘tell’ as collocations, providing definition, PoS, and synonyms.)

Participants gave positive feedback on synonyms provided by the word search tool, where more and less frequent synonyms to the search word are displayed with some information on meaning variation provided. They also noted that with these tools, students are able to access the program on their own for autonomous learning.

Here’s our handout on using Word And Phrase.

For information on other Tech Talks organized by the iSpace at University of Arizona, please visit

Interested in our other workshops? Check out our workshops page.

Tagged with:

By Kelly Marshall and the AZ Crow Team

On February 17, 2018, the University of Arizona Corpus Lab hosted an introductory workshop on how to use AntConc at the 17th Annual SLAT Interdisciplinary Roundtable. The workshop was lead by Adriana Picoral, Nicole Schmidt, Curtis Green, and Shelley Staples, with help from Kelly Marshall, Ali Yaylali, Nik Kirstein, and Yingliang Liu. For this workshop, we changed the layout of our last workshop to better fit the needs and purposes of the attendees at this conference. The first notable change was the use of two different corpora: Arizona Second Language Writing Corpus (ASLW) (part of Crow) and Spanish Learner Language Oral Corpora (SPLLOC). The components we used from the ASLW corpus included Narrative and Rhetorical Analysis student-written papers, while the components we used from the SPLLOC corpus were Modern Times Narrative and Photo Interview files. The goal of the workshop was to help instructors understand how to use AntConc, and how to integrate the application and results into their pedagogy. This was different from our last workshop presentation (given Nov. 21, 2017) where we focused exclusively on the ASLW (Crow) since our audience for that workshop was instructors in the UA Writing Program.

Other differences included the space the workshop was in as well as the activities. The workshop was hosted in one of the computer labs in the Modern Languages building. This room allowed for all workshop participants to interact, learn, and explore the AntConc program instead of having to share with another participant like last time. However, since the time slot was only an hour and fifteen minutes (rather than the hour and forty five minutes allotted last semester), we condensed the workshop by covering terms during the activities rather than presenting them at the beginning. The other aspect that was condensed was the number of activities participants completed, from five activities to three. This was also done to allow participants, like last time, to independently explore the program, interact with one another, and ask us questions they had after completing the activities.

Before the workshop, we ensured all computers had the AntConc application and the appropriate corpora files in Spanish and English were downloaded. This allowed us to save time and start the workshop promptly, without having to spend the first part of the session instructing participants how to download and access the files and program. This pre-workshop preparation process was necessary because we did not know who the participants were in advance (so we were unable to contact them with instructions on how to access the data). In the future, our corpus data will be more easily accessible through a website, which will facilitate this process.

During the workshop, participants were taught how to hide tags so personal, instructor, and other course related information included in the student papers were not displayed in the results.  It should be noted that a potential problem with hiding tags is that the output will be limited in the concordance function. Although we did not introduce this issue at the beginning of the workshop, we showed participants how to solve this problem when we presented activities using the concordance function (i.e., unhide tags if more text is desired). The activities focused on instructing participants to search for specific words or N-Grams (contiguous sequences of words, e.g., 1-gram, 2-gram, 3-gram), and how to see these in a list, in the Word List function, or as key words in context (KWIC) in the Concordance Function.   

(KWIC concordance results with tags included.)

(KWIC concordance results with tags hidden.)

When searching in the concordance window, those in the workshop were taught how to select window size, and to search by frequency, range, or word.  Using the KWIC search shows the words 1, 2, or 3 places left or right of the key word. In addition, participants were taught how to search by prefixes and suffixes, or locate citations by searching “(*)”.

(N-Grams sorted by range to show the most common n-grams across all uploaded files.)

While there were notable differences between the two workshops, both had the underlying goal of providing instructors a new approach to create materials and illustrate the pragmatic use of lexical items and grammar in order to show their students the contexts and patterns of words within a specific genre. Moreover, throughout both workshops, we asked participants questions and had a conversation with participants regarding how AntConc could be used to provide authentic writing examples and address common error patterns.

The workshop concluded with a discussion, first in small groups and then with the entire group, about how these methods translate into lessons. The teachers were given time to reflect on how they might use what they had learned in their own pedagogy.

Here’s our AntConc handout from the workshop.

Interested in our other workshops? Check out our workshops page.

Tagged with:

Members of the interdisciplinary Crow team have been working on what we’ve been calling internally our “Citation Project” since the Summer of 2017. This name is our homage to The Citation Project conducted by Rebecca Moore Howard and Sandra Jamieson.

Wendy Jie Gao, Lindsey Macdonald, and Terrence Zhaozhe Wang videoconference with Shelley Staples and Adriana Picoral.

When the project research was first presented at Corpus Linguistics in 2017, it was titled ”Variability in Citation Practices of Developing L2 writers in First-Year Writing Courses”.  The purpose of the study can be stated as follows: “By examining L2 students’ citation practices in their assignments (Literature Review and Research Paper) for an introductory writing course, we explored their preference for particular citation styles and possible variance across assignments and instructors.”

At the current time, our research focuses on what we’re calling citations and non-citations, as well as the various forms and functions of the citations students are using in two genres: literature reviews and argumentative essays. All of the documents used for the project are from the Purdue Crow Second Language Writing corpus, and a total of 132 papers and 147,000 words have been analyzed. We are examining many different styles of citations, including quote and non-quote, as well as integral and non-integral. An integral citation includes the author’s or article’s name in the sentence being cited. For a non-integral citation, the author’s or article’s stated name is in parenthesis at the end of the sentence. A non-citation doesn’t explicitly state the name of the author or article.

Our findings revealed that students use more citations in a research paper than a literature review and they have a preference for integral citations especially in a literature review. Most importantly, we discovered student’s work is highly framed around sample papers that the instructors provide for students.

Our team plans on presenting their research on March 27 at the AAAL 2018 conference (9:10 to 9:40am, Arkansas Room). We hope to grow the amount of documents which are a part of the project in order to expand the knowledge it can provide.

The Crow team is composed of a variety of different scholars at many different levels of academia from many different fields.  Crow includes various professors of writing, ESL, EAL, SLAT, and many other areas of English and language.  On top of this, Crow also includes three undergraduate interns which broadly expands their experience by introducing them to many workplace aspects such as a collaborative work environment, research opportunities, and more! Each of the undergraduate interns became a part of crow for different reasons and hope to further pursue their academic career through the experience gained here. Below each intern explains how they first became involved with Crow and what experiences they hope to gain from this internship opportunity. 


Nik Kirstein: Nik Kirstein is a junior in Information Science.  He first got interested in Crow after working with a corpus to analyze the Russian language.  Crow helps Nik gain experience in text and data processing and has introduced him to some corpus informatics applications such as AntConc.   All of this ties into information science very well.  Nik hopes to gain more experience in data visualization and back end database development with corpus data.  He wants to work in the CyberSecurity Industry one day.


A picture of Blair NewtonBlair Newton: Blair Newton is a senior in Professional Writing. She first heard about Crow from her Intro to Professional Writing professor, Dr. Michael Salvo. The internship opportunity appealed to her because of how much varying experience she would be exposed to that classes couldn’t offer. Blair does research, blog posts, grant writing, and graphic design for Crow. She hopes one day to combine writing and marketing as a career and eventually even write a novel.



A picture of Jessica KuklaJessica Kukla: Jessica Kukla is a senior professional writing major on the editing and publishing track at Michigan  State. While writing and editing is her forte, Jessica has a growing interest in technical writing and information and experience architecture, which lead her to working with Crow. She hopes to gain more experience with grant writing and working with corpus data. After MSU, Jessica hopes to pursue higher education in something along the lines of information architecture.


By Nik Kirstein, Blair Newton, & Jessica Kukla 

On November 10th, 2017, the University of Arizona Corpus Lab held its AntConc Workshop.  AntConc is an application that allows users to view useful information about a text such as the word frequency, placement of search term in the text, and more. The main goal of the workshop was to help instructors 1) to develop an understanding of how to use Crow and AntConc to address language awareness within their writing classroom, and 2) to understand the value of using students’ writing. Corpus data offers a new way to look at learning English as a second language.

Using a corpus, instructors can see what common mistakes their students make or what patterns of language are more common in certain genres, and then create activities based around them. For example, when the instructors searched for parentheses, the locations of the citations within the papers could be seen in the concordance plot. This helped instructors see how their students were using citations and whether or not they were being used correctly.

Another idea during the workshop was comparing the use of the word “like” in written papers versus spoken English. The differences in writing and speech help us understand how these students are learning and understanding English. The workshop was a great success.

Photo from AntConc Workshop