Corpus and Repository of Writing

The web interface for Crow, the corpus and repository of writing, depends on a complex amalgam of interdependent bits of code built by thousands of people.

Five minute read.

A few years ago Thomas Thwaites decided to make a toaster. From scratch. Armed with the breadth of extant human knowledge (thanks, internet!), after a bit of petroleum refining here, a bit of iron smelting there, Thwaites would assemble some things into a bigger thing that would be capable of toasting bread. Surely a simple task. It wasn’t. Thwaites’ exploration into the institutional knowledge and global dependencies that go into a seemingly trivial kitchen appliance makes visible the complexities of something we take for granted. Thwaites shows the real cost of a $20 toaster.

Most web applications I build can’t make your breakfast, but like your average toaster, they have a deep system of prerequisites and dependencies. And even though I do software development full-time—even though my job is to understand what’s going on behind the scenes—I take for granted how much my software relies on the work of so many others.

Like practically all software today, Crow’s online corpus and repository of writing leverages many other software packages — bundles of code that provide discrete services such as sending and receiving data over the web, reading from and writing to a database, rendering tables and forms and charts. Caching. Authenticating. Validating. In the case of Crow’s software, there are also corpus-specific libraries for normalizing, tokenizing, lemmatizing, indexing, querying, filtering, excerpting, and highlighting.

Those parts of the Crow code were created (and are actively maintained) by other developers. Not me. The software I build just sits on top of it, connecting the dots to make purpose out of possibility. And part of my development time is simply keeping Crow code updated with changes in those dependencies, changes which can take the form of bug fixes, security patches, and new features.

To help me with those updates, I built a tool to visualize my code’s dependencies: the Composer Dependency Tree Generator.

Package Management

A bit of background: most contemporary software uses package management tools to define and retrieve all of the bits of those external software libraries that are needed. In the case of the PHP programming language, that package management tool is Composer. Package managers also help developers identify when those packages have available updates (though testing the updates before applying them is still the work of the developer). Package requirements are stored in a single file which contains all of the building blocks of the application — its DNA. The Composer Dependency Tree Generator takes that DNA file and renders it as a collapsible tree.

As visualized below, the Crow software backend requires 36 packages in order for me to write the “real” code of the application (click image to interact).

Maybe 36 dependencies for a single project seems reasonable, even manageable. But as you probably already guessed, those packages depend on other packages. Zoom in on one small but critical part of the application: user authentication. Crow depends on Simple OAuth to send user credentials through the internet securely and verify that the user is who they say they are. OAuth depends on other package (for instance, league/oauth_server implements the OAuth 2.0 standard for handling access tokens and refresh tokens), which in turn depend on other libraries like defuse/php-encryption to encrypt data using the OpenSSL protocol, which is, itself, a dependency. The fractal nature of dependencies quickly comes into focus (select to enlarge):

Because of these interrelationships, Crow’s DNA file — not the actual code, mind you, just the list of which building blocks the code needs — is 8,700 lines long. Behold, the fully articulated Crow dependency tree (select to interact):

So: a lot of building blocks. A lot of moving parts. A lot of humans writing code. When you think about it this way, that simple blog you pay $20 a month for was built — and is actively maintained — by thousands of people.

So what?

For me, three main insights come to mind. First, software developers simply cannot build applications without relying on a preponderance of other people’s work. One open source project I built for the Drupal content management system, Layout Builder Restrictions, is in active use on 12,000 websites. Let’s say it’s taken me 100 hours to write and maintain that code. If the developers of each of those 12,000 websites had to build the same functionality as Layout Builder Restrictions, individually, just that tiny bit of functionality would have added weeks of work to the timeline of each of those sites: 1.2 million hours of total developer time.

Second: put that first point the other way: the fact that so many developers’ work is provided open source means that we can quickly build many applications that do many different things: 12,000 websites can be built in a fraction of the time it would take developers working in isolation. (And if you’ll forgive the pretentious meta moment, I would point out that my Composer Dependency Tree Generator, itself, is simply an implementation of the open source visualization tool, D3JS. It took me only hundreds of lines of code to write because someone else had already written thousands.)

And third (the inescapable and scary implication of the previous points): given the amount of code it takes to build a modern web application, and given that code is necessarily built by many different people, a single development team on a single project simply cannot review and vet every line of code. Scanning the Crow dependency tree above, I’ve read far less than 1 percent of its total code. There is no other way to say it: the way we build software — the only way we can build software — carries with it inherent vulnerability.

But there’s good news. I may not have looked at the code personally but many other open source contributors have. And when problems are discovered — bugs or security flaws — developers collaborate to fix the problems and make those fixes available to the rest of us. And that’s why I think it’s so important not just to use open source software but to contribute. Return to that project I built, Layout Builder Restrictions. If I made that software proprietary and charged those 12,000 websites $20 a month, I’d be rich enough to retire. But then: if I had to pay $20 a month for the myriad packages my applications depend on, I’d be filing for bankruptcy.

So: the open source methodology works because of deep interdependencies, and open-sourcing creates a virtuous cycle through participation and proliferation. When I can share my labor back to the community, all things considered, it’s a bargain.

Screen capture showing the Humanities Without Walls grand research challenge.
Screen capture of the HWW Grand Research Challenge

Summer is in full swing for the Crow team. Crow was recently awarded a seed grant from Purdue’s College of Liberal Arts. This grant provides funding for summer work that will help the team prepare to write an application to the third Humanities Without Walls Grand Research Challenge this fall.

As part of the work this summer, we will begin the process of updating the Crow website. We’re looking to enhance our site’s usability, design, content, and features so that everyone who accesses it can get the most out of the resources it offers.

Since HWW this year stresses reciprocity and redistribution, we will also review our past Crow workshops (and attendee responses to them) to give us more insight about how our work gives back to the academic communities we interact with.

Finally, we will be expanding Crow outreach, working to partner with a number of diverse institutions.

It’s going to be an exciting few months, so stay tuned.

Building on the success of her previous Crow workshops engaging the R environment, Dr. Adriana Picoral will host “Quantitative Language Data Analysis and Visualization in R” on April 17, 2021, from 9:00am to 11:00am (Arizona/US/MST). (See this in your time zone.)

In this workshop, we will work with the Variable that data from Tagliamonte’s book “Variationist Sociolinguistics” (2012). We will visualize frequency of that complementizer omission across speaker groups, and run both linear and logistic regression. Concepts such as correlation, interaction, and contrasts will be addressed.

Interested?

  1. Please register for the workshop.
  2. Download and install the latest R version from https://cran.r-project.org
  3. Download and install the latest RStudio from https://rstudio.com/products/rstudio/download/#download 

If you are unable to attend, watch for a video on the Crow YouTube channel the week following the workshop.

Questions? Please contact Dr. Picoral.

We’re happy to share some recent good news from across the Crow team.

Nina Conrad

Congratulations to Crow researcher and University of Arizona doctoral candidate Nina Conrad, who was awarded a Bilinski Fellowship for her dissertation project, “Literacy brokering among students in higher education.” The fellowship will fund three semesters of writing and includes professional development opportunities as well. 

Crow researcher Hannah Gill was admitted to the Mandel School of Applied Social Sciences at Case Western Reserve University, including a scholarship and funding to support her field work. In May, Hannah will graduate from the University of Arizona, with a double major in English and Philosophy, Politics, Law, and Economics (PPLW).

Thank you to everyone who attended our third Crow Workshop Series event, focusing on grant writing. If you were not able to attend, please see the video on our YouTube channel. Our slides and handout are also available. 

We were so pleased by the turnout. Our workshop team (Dr. Adriana Picoral, Dr. Aleksandra Swatek, Dr. Ashley Velázquez, and Dr. Hadi Banat) is reviewing the feedback we got and planning our next event. Stay tuned! 

Dr. Adriana Picoral

Dr. Picoral was awarded a mini-grant for a series of professional development workshops designed to increase the gender inclusivity of the data science programs at the University of Arizona. The workshops will be hosted by Dr. Picoral in cooperation with two invited speakers. 

Ali Yaylali, Aleksey Novikov, and Dr. Banat wrote about Crow’s approach to data driven learning (DDL) in “Using corpus-based materials to teach English in K-12 settings,” published in TESOL’s SLW News for March 2021. This is our second piece for SLW News, following “Applying learner corpus data in second language writing courses,” written by Dr. Velázquez, Nina Conrad, Dr. Shelley Staples, and Kevin Sanchez in October 2020. 

Finally, Dr. Picoral, Dr. Staples, and Dr. Randi Reppen published “Automated annotation of learner English: An evaluation of software tools” in the March 2021 International Journal of Learner Corpus Research. Here’s the abstract:

This paper explores the use of natural language processing (NLP) tools and their utility for learner language analyses through a comparison of automatic linguistic annotation against a gold standard produced by humans. While there are a number of automated annotation tools for English currently available, little research is available on the accuracy of these tools when annotating learner data. We compare the performance of three linguistic annotation tools (a tagger and two parsers) on academic writing in English produced by learners (both L1 and L2 English speakers). We focus on lexico-grammatical patterns, including both phrasal and clausal features, since these are frequently investigated in applied linguistics studies. Our results report both precision and recall of annotation output for argumentative texts in English across four L1s: Arabic, Chinese, English, and Korean. We close with a discussion of the benefits and drawbacks of using automatic tools to annotate learner language.

Picoral, A., Staples, S., & Reppen, R. (2021). Automated annotation of learner English: An evaluation of software tools. International Journal of Learner Corpus Research, 7(1), 17–52. https://doi.org/10.1075/ijlcr.20003.pic

We thank all of the Crow researchers and Crow friends who supported this good work, and the editorial teams, reviewers, and funders who made it possible. 

The Crow leadership team would like to express its condemnation of anti-Asian and gender-based violence and to communicate its support for Asian Americans, Asians, and Pacific Islanders. We are saddened and angered by the murders of Delaina Ashley Yaun, Paul Andre Michels, Xiaojie Tan, Daoyou Feng, Hyun Jung Grant, Soon Chung Park, Suncha Kim, and Yong Ae Yue. We condemn the increased violence this past year against Asian and AAPI students, faculty, and individuals at our own institutions, both in the U.S. and abroad.

Our team closely collaborates with our Asian and AAPI team members, and greatly values the contributions of Asian and AAPI students and teachers as participants in our research. We want to express our solidarity with those individuals and others in our own professional networks, and invite others to do the same for their students and colleagues.

Please visit Stop AAPI Hate for more information and resources.

Ashley Velázquez
Shelley Staples
Michelle McMullin
Adriana Picoral
Bradley Dilger
Randi Reppen

The Crow workshop series continues! 

Workshop flyer. PDF version linked, text in main post.
Before you start “writing” flyer. PDF version available.

Fellowship and Grant Writing for PhD Students & Early Career Scholars, Part I: Before You Start “Writing”
In this workshop, we will discuss why you should apply to grants and fellowships (and the difference between these). We will also address how to find grants and fellowships, as well as how to prepare for applying. Designed for early career scholars from around the world who conduct writing research, broadly construed. 

Saturday, March 13, 2021, 9:00 to 10:30 am Pacific/USA 
(UTC: Sat Mar 13, 17:00 to 18:30)

Presenters are Aleksandra Swatek, PhD; Hadi Riad Banat, PhD; Adriana Picoral, PhD; and Ashley J Velázquez, PhD.

Please register for the workshop and share any questions you have beforehand. We hope to see you there! 

The Crow team is excited to be a part of the University of Arizona’s Women’s Hackathon for 2021. We’ll be offering a workshop, “Collaborating online: Lessons from a Successful Team,” on Saturday, March 6, at 1:00pm Mountain time. Michelle McMullin, Shelton Weech, and Bradley Dilger will be facilitating.

Collaborating online: Lessons from a Successful Team
Based on the experiences of an interdisciplinary software design and research team working at multiple sites, we share three principles for collaborative teams who prioritize inclusivity and mutual respect. Examples and practical techniques will help your team work together more effectively both asynchronously and when working together in person.

We offer three best practices you can adapt to your team:

  1. Build visible infrastructure: For online teams, digital infrastructure is the documents and communication that facilitate work. We share the consecutive agenda, our approach to keeping notes and agendas for meetings, and principles for using a team communication platform like Slack, Basecamp, or Microsoft Teams.
  2. Practice active listening: Krista Ratcliffe describes active listening as actively seeking to hear what is different about the ideas of other people. We offer several concrete approaches for listening actively to others on your team.
  3. Coordinate work purposefully: Distributed teamwork requires connecting people, tools, and documents that are separated geographically, sometimes in different time zones. Scholars call this work coordination. We describe ways to coordinate work across documents and digital infrastructure.

We’ve created a template for the consecutive agenda Crow teams use to combine meeting agendas, notes, and links to our team communication platform. Examples of other techniques appear in the video presentation.

Our materials:
A video of our presentation, for those unable to attend synchronously.

The slide deck for our presentation is also available.

Our second Crow workshop will be held on December 19, 2020 from 9:00 to 11:00am (Arizona time/MST).

“Corpus Searches in R: Regular Expressions and Concordance Lines” will be hosted by Adriana Picoral, PhD, assistant professor of data science at the University of Arizona. 

Workshop flyer. PDF also available.

Corpus Searches in R: Regular Expressions and Concordance Lines
Saturday, December 19, 9am to 11am (Arizona time/MST).

In this workshop, we will work with a tagged corpus. We will go over the steps of reading in a corpus (organized as multiple text files) in R, doing searches in the corpus using regular expressions, and producing concordance lines. We welcome to this workshop corpus linguists that are not yet familiar with R but interested in expanding their coding skills.

Register through Zoom. For more information, please contact Dr. Picoral.

Did you miss our first workshop? Watch a video on our YouTube channel.

Workshops in Spring 2021 will be announced soon. Got a workshop suggestion? Let us know!

In May of 2020, Crow members Ashley Velázquez, Hadi Banat, and Shelley Staples hosted a workshop with Metropolitan State University of Denver (MSU Denver) faculty and students. Originally, our workshop was intended to be held during TESOL’s International Conference in Denver, Colorado. Unfortunately, due to COVID-19, we were not able to attend TESOL this year, but we were able to continue our Outreach efforts by advertising our workshop with interested parties. To our delight, several folks at MSU Denver were excited to participate in a virtual workshop with us to learn more about Crow’s online corpus and how our corpus can be used for innovative teaching and research, teacher-training, and its usefulness for Writing Centers.

Slide from our presentation, reading "Using the Corpus and Repository of Writing for Teaching and Research." Two images: concordance lines showing a query for "research," and a cartoon of people of diverse ages, genders, and races saying "Hello" in multiple languages.
“Using Crow for Teaching and Research,” our slide deck

In alignment with our goals for our ACLS Digital Extension Grant, outreach efforts this year have primarily focused on expanding our corpus to include representation of multilingual writers to a new population of heritage Spanish writers at the University of Arizona while also reaching out to other institutions that serve this population of students. MSU Denver is a newly designated HSI, or Hispanic Serving Institution, so it was fitting that we were able to introduce Crow to this particular audience.

Our workshop with MSU focused on both teaching and research. Unlike past workshops, we focused on building an explicit relationship between teaching and research that was accessible to those who have little to zero experience with corpus linguistics. Additionally, unlike other workshops, our audience members, except for one, were all teachers in training and writing center tutors in training, enrolled in the RIDES program. Finally,  we were invited to conduct this workshop as part of a mentoring course for the RIDES program. Until now, the majority of our workshops have been held at, or alongside, conferences (excluding our workshops at Wright State University and Universidad de Sonora). 

We introduced our online corpus by starting with a few simple searches and introducing participants to the various filtering options and asked participants to examine the different information available during these searches while also demonstrating the connection between our corpus and our repository. After demonstrating a few searches, we asked our audience to think of how we might use such searches (e.g., transitions and synonyms) for developing classroom and tutoring-specific activities. For example, for synonyms, we may want to help students develop their vocabulary by noticing nuanced differences between near synonyms like important and significant. Teachers can help students discover and notice these differences by providing authentic examples of these synonyms in use and guiding them with questions and corpus-based activities. 

Finally, we introduced the audience to the repository interface features and the metadata pertaining to the pedagogical materials we are collecting. For example, workshop participants explored the repository searchability tool and filters to look for specific pedagogical materials pertaining to certain assignment genres of interest. By going through metadata filters such as institution, year, semester, course type, modality and length, they got a better sense of the variety in pedagogical materials across Crow sites. We then demonstrated some searches with the repository, focusing specifically how assignment handouts, syllabi, rubrics, and classroom materials may be used during a tutoring session in the writing center and for the purposes of tutor-training. 

What did we learn from hosting the workshop?

The writing center tutors in training at MSU Denver will be part of the RIDES program, a writing center intervention that supports culturally and linguistically diverse students with practical language skill instruction, sometimes not prioritized in a writing center consultation. The audience of tutors were not familiar with corpus driven methods as pedagogical interventions. The time we spent introducing data-driven learning (DDL) pedagogical activities helped them consider nontraditional activities they can use in writing center consultations. One such activity is our “Transition words” activity. This activity introduces students to a variety of transition words and walks them through the process of noticing the types of transition words used, where they’re located in sentences, and the structures used with each transition word. 

Our main takeaways as Crow researchers and teachers keen on sustaining outreach to diverse audiences at Hispanic Serving Institutions are the following:

  • Novice Corpus Users: Continue expanding our reach to audiences who do not have prior experience with corpus linguistics and make corpus-based pedagogical approaches accessible and approachable to nontraditional audiences like writing center tutors, teachers in training, and under-represented minorities.
  • Scaffolded Workshops: Develop a series of workshops, specific to the needs of  novices in corpus linguistics, that scaffolds corpus-based teaching and research. This  may be a beneficial step towards unpacking threshold concepts and making corpus linguistic methods less intimidating.
  • Undergraduate Audiences: Strategically reach out to undergraduate students and make our workshops accessible to this population. This is especially relevant since Crow has experience with working with undergraduates on our research team.
  • Teachers in Training: Explore opportunities to work with teachers in training who may not have sufficient TESOL or TESL background and training. Sometimes lack of training is due to lack of resources, and this realization further helps us address the ACLS funded outreach goals for Crow. 
  • Writing Center Directors: Build relationships with writing center directors who are keen on introducing new pedagogical interventions for writing center consultations and in tutor training programs. Writing center tutorials usually focus on tutee writing, so shifting the paradigm towards mentor texts could be a beneficial intervention with tutees who need more language instruction support. This paradigm shift honors descriptive vs. prescriptive approaches and defies the deficit model in tutoring multilingual writers. 

We thank Rachel Hawley for inviting us and helping us attract an audience. We look forward to applying what we’ve learned to our next workshops. 

We are pleased to share that Crow researchers will be hosting a series of workshops targeted at teacher-scholars who, like us, value inclusive approaches to studying and teaching writing. These free hands-on workshops will be held on Zoom, making them accessible to people across the globe.

Our first workshop, “Corpus Data Scraping and Sentiment Analysis,” will be hosted by Adriana Picoral, PhD, assistant professor of data science at the University of Arizona. 

Flyer for Crow workshop Nov 10, 2020
Workshop flyer. PDF also available.

Corpus Data Scraping and Sentiment Analysis
Saturday, November 7, 10am to 12pm (Arizona Time/MST)

In this workshop, we will scrape Amazon for reviews using the rvest R package to build a corpus of product reviews. We will then do some sentiment analysis from a critical perspective. We welcome to this workshop corpus linguists that are not yet familiar with R but interested in expanding their coding skills.

Register through Zoom. For more information, please contact Dr. Picoral.

Future workshops will include other subject matter including grant writing, developing distributed teams, applying for dissertation fellowships, building learner corpora, and more. Got a workshop suggestion? Let us know!