Corpus and Repository of Writing

2018 AACL conference presentations

Our Crow team reached a new milestone at the 14th American Association for Corpus Linguistics (AACL) this past September: our first presentations of inter-institutional projects!
The two presentations, “Annotating learner data for lexico-grammatical patterns: A comparison of software tools” and “Lexico-grammatical Patterns in First Year Writing across L1 Backgrounds”, were given by Crowbirds from the University of Arizona, Purdue University, and Northern Arizona University.

Adriana Picoral leading powerpont presentation in front of classroom of researchers.

Adriana Picoral leading the first presentation, “Annotating learner data for lexico-grammatical patterns: A comparison of software tools.”

The first project, “Annotating learner data for lexico-grammatical patterns: A comparison of software tools” was led by Adriana Picoral. The team, consisting of Adriana Picoral, Dr. Randi Reppen, Dr. Shelley Staples, Ge Lan, and Aleksey Novikov, compared three tools: 1) Biber tagger, a POS and syntactic tagger that integrate rule-based and probabilistic components; 2) MALT parser, an open source statistical dependency parser; and 3) Stanford parser, another open source statistical parser widely used in natural language processing applications. The corpus for this study was sampled from our larger inter-institutional corpus of first year writing (FYW) texts, and consisted of a total of 16 documents from 3 institutions (Purdue University, University of Arizona, and Northern Arizona University) and 4 first language backgrounds (Arabic, Chinese, English, and Korean) for a total of 27,930 tokens.

All documents were annotated using all three tools. Gold standard labels were also created by up to four human coders for each document. Predicted labels from the three tools were then compared with the human-created gold standard labels. Precision (when a word was annotated, if it was correct) and recall (whether the annotation was identified on a word) were calculated for each one of our target features (noun-noun sequences, attributive adjectives, relative clauses, and complement clauses) across the different tools. The team presented methods, including descriptions of the web-based interfaces built for human tag-checking, and the evaluation measures from all three tools. While the Stanford parser performed better when labeling our target clausal features, the Biber tagger performed better for the targeted phrasal features.

Post-processing scripts will be used to improve both tools’ accuracy, and the team may combine their output to achieve higher performance rates on automated annotation of our learner data in the future.

The second project presentation, “Lexico-grammatical Patterns in First Year Writing across L1 Backgrounds,” was led by Dr. Shelley Staples with help from other Crowbirds Dr. Randi Reppen, Aleksey Novikov, and Ge Lan, and including collaborators Dr. Qiandi Liu and Dr. Chris Holcomb from University of South Carolina. The group compiled a balanced corpus (612,100 words) of argumentative essays across four L1s – English, Chinese, Arabic, and Korean, which was then tagged with the Biber Tagger and improved for accuracy with post-tagging scripts. The researchers investigated the use of six features: attributive adjectives, pre-modifying nouns, that- and wh-relative clauses, that- verb complement clauses, and that- noun complement clauses both quantitatively and qualitatively. ANOVA was applied to test the differences among the four L1 groups and across two different institutions (Northern Arizona University and University of South Carolina).

The results showed significant differences in the way the four features were used across the four L1 groups (p < 0.05), particularly attributive adjectives, premodifying nouns, that- noun complement clauses, that- and wh- relative clauses. Compared to L1 English writers, L2 writers tended to rely more on the repetition of phrasal features. They also used more wh-relative clauses than that- relative clauses, which could be explained by more prescriptive instruction on wh- relative clauses for L2 writers, as opposed to the influence of oral language and a lack of register awareness for L1 English writers.

Finally, attributive adjectives and that- relative clauses had significant differences for Chinese L2 writers (p < 0.05), whereas no significant difference was found for any feature between the two institutions for L1 English writers. A possible reason for this difference is that students from USC who used more of the two features, may have had higher proficiency, but the NAU students were in a bridge program working on improving their proficiency. An alternative explanation is that relative clauses were included in the USC syllabi, while it is unclear whether this instruction was received at NAU.

Both conference presentations were very well received at AACL. We plan to submit a publication for the first paper to NAACL or ACL in the near future.

Tagged with: ,

Symposium Recap: Session Highlights

Continuing our symposium recaps, we want to share a bit about every session. Writing Research Without Walls 2018 hosted presenters from various US institutions and programs like English, Rhetoric & Composition, Second Language Studies, and Engineering. Their contributions enriched our conversations about approaches to teaching and researching writing. The thematic relevance of these talks provided opportunities for scaffolding research initiatives and networking among presenters who shared common research interests. These brief recaps move in the order on the program.

  • Neil Baird and Bradley Dilger showed how the discourse-based interview is an insightful research technique for investigating writers’ tacit knowledge, a rich source of data for writing researchers. We found the techniques they shared about updating this research method for digital media helpful and up-to-date.
  • Jie Gao, presenting on behalf of a team at Purdue and Arizona, described the form and function of L2 writers’ citation practices in first year writing courses. Their emphasis on defining the rhetorical function of these citations allowed us to witness connections with pedagogical materials and source texts.
  • Eunjeong Park’s mixed-methods study of lexical bundles combined analysis using a learner corpus with interviews and intervention in an L2 writing course. Attendees valued the depth her research design provided.
  • Ashley J. Velázquez discussed her investigation of L1 and L2 students’ problem-based writing in a first year engineering program. The mismatch between pedagogical materials and faculty expectations about writing quality was an interesting takeaway.

  • John Gallagher, Nicole Turnipseed, John Yoritomo and Julie Zilles focused on integrating writing instruction into engineering and science at Illinois Urbana Champaign. We benefited from their design of instructional materials and assessment of writing in engineering and physics courses.
  • Tatiana Teslenko exposed the advantages and challenges embedded in collaboration among faculty in writing studies and engineering courses at the University of British Columbia in Vancouver. Attendees found her emphasis on mentoring international graduate students who are serving as writing fellows in WID courses very insightful.
  • Tamara Roose described her approach to shaping both her curriculum and pedagogy in response to the input of the Chinese international students in her ESL writing course. We loved the way she asked audience members to voice her students’ writing.

  • Mariam Al Mayar presented an interesting profile about the needs of the Afghani student population in US institutions. Her contribution was instrumental to clarifying the literacy experiences of these students while learning English as a foreign language in Afghanistan highlighting their challenges in transitioning to the US.
  • Estela Ene and Thomas Upton presented two studies. The first described their corpus-based move analysis of teacher-student chats in ESL online and hybrid classes. Their emphasis on the negotiation taking place between the teacher and students was eye opening. The second continued the conversation about teacher feedback by comparing between synchronous and asynchronous teacher e-feedback in ESL classes. Their findings about teacher feedback practices and students’ receptivity and interaction generated a lot of questions from the audience.
  • Negin H. Goodrich discussed the efficacy of applying a combination of two types of corrective feedback to promote accuracy of student texts in an international writing classroom. Her results show the benefits of integrating two types of feedback, which helped the audience reflect about the practices we use for assessing L2 writing.
  • Sweta Baniya, sharing a Purdue study of linked courses, introduced Adaptive Comparative Judgement as a method for holistic assessment of writing and comparison of various student work in a writing classroom. They presented alternatives for rubric-based methods and reflected on the groundwork needed to build criteria for assessments in linked courses.
  • Adriana Picoral’s presentation on native language identification was unique. She used computational methods to show how she constructed L1 background profiles from L2 writing. Her work inspired us as we think of the profiles of student writers we can create from the metadata accompanying corpus texts in Crow.
  • David O’Neil used corpus methods to investigate how fourteenth and fifteenth century alliterative poems were part of a continuous tradition dating back to the seventh century. His rigorous data coding methods of syntactic, prosodic, and rhetorical features were engaging and inspiring.

  • Alisha Karabinus and Lee Hibbard shared with us the results of a survey they conducted at Purdue in 2017 and 2018 to investigate student perceptions of pre-university writing instruction and experiences. The diverse profiles they constructed from their data will help first-year writing administrators design writing curricula and classes that cater for various needs.

  • Adam Steffanick shared the Vanderbilt University Writing Repository, repository of L2 student writer texts constructed by collaborations between libraries and second language researchers at Vanderbilt. Attendees compared their project to the methods we adopted in building the Crow corpus.

We found the conversations between writing researchers and engineering faculty particularly constructive given our interest in interdisciplinary collaboration and its usefulness for designing curricula, pedagogies, and teaching artifacts that help bridge between writing practices in academia and the industry. We’re grateful to all the presenters for sharing their work.

Reflecting on the symposium: Our plenary speakers

It has been over a week since we wrapped up our Crow symposium, and we can’t stop thinking about the great conversations that took place. Our keynote speakers, Dr. Susan Conrad and Dr. Shondel Nero, both explored potential changes to writing instruction: the former by integrating it more into Engineering coursework, and the latter by engaging the vernacular rather than standard concepts of English. Their research presents us with the opportunity to transform and adapt pedagogical strategies to better suit the needs of students and challenge the thinking of instructors.

Plenary 1: Dr. Susan Conrad

In “Improving Writing Instruction in Engineering through Interdisciplinary Collaboration,” Dr. Conrad focused on the gap in writing style between Engineering students and practitioners and ways to address that gap without completely changing the curriculum. Her research has found that students write long, complex sentences thinking that makes them sound smart. By contrast, practitioners write simple and concise reports, which are easy for clients to skim and understand. Additionally, students tend to use superfluous terms, whereas practitioners are very careful about certainty and quantification due to the real-world implications of absolute, immeasurable language.

Dr. Conrad traced the differences between student writing methods and practitioner writing methods back to three instructor conventions. The first is for English faculty to have a formulaic set of rules for producing good writing, while practitioners rely on the development of judgement. The second convention is for writing instructors to give students more room for self-expression and independence, whereas practitioners function within tighter constraints. Lastly, instructors view writing as a series of style choices, but practitioners see writing as a form-follows-function process.

Dr. Conrad is working to address this gap in writing style (though she wished for a different term!) by collaborating with practitioners and Engineering faculty to determine best pedagogical practices. To achieve this goal, Dr. Conrad and her colleagues have created a website for their study, including materials to help students develop writing practices that more directly correspond to their field. By more closely matching practitioners’ work, Civil Engineering students will be better prepared for the workforce. We thoroughly enjoyed Dr. Conrad’s presentation and look forward to seeing future impacts of her project.

Plenary 2: Dr. Shondel Nero

Dr. Nero’s presentation, “Engaging Vernacular Englishes through Literature in the Writing Classroom: Paradoxes, Pedagogy, Possibilities,” addressed the dichotomy between “standard” language and the equally important but often ignored vernacular and urged for the rethinking of language awareness.

Working in applied linguistics and English as a Second Language (ESL), Dr. Nero witnessed the way the ideology of standard language produced educational inequality. For instance, a student from her native country of Guyana, in which English is the official language, was placed in an ESL class despite his proficiency in English.

From these experiences, Dr. Nero raised questions about the validity of standard language with academia. Who gets to decide who is a native speaker? Why are all non-standard dialects considered “deformed” despite the great prevalence of multilingual speakers within American society and the fact that most multilingual speakers in classrooms are born in the US? Language is an ever changing medium in which everyone has an accent, so why the shame in using vernacular speech in the classroom?

Dr. Nero’s answer to these questions started with exposing the myths surrounding standard language and vernacular language. The myth of standard language is supported by the beliefs that there can only be one superior language which remains fixed and unchanging and is devoid of accents. These beliefs work alongside the anglo assumption that knowing only the English language is enough. In contrast, the myth of the vernacular is that it is a “deformed” version of the standard, lacking in grammatical structure, and spoken only by the uneducated or so-called lower classes.

To dispel these erroneous assumptions and shift academic attitudes away from language awareness and toward Critical Multilingual Awareness (CMLA), Dr. Nero has worked to foster greater acceptance of linguistic diversity among educators through workshops and the use of linguistically informed teaching material. She described ways to help teachers learn about different cultures so they could be more culturally responsive, and modeled an activity which included listening to vernacular literature then discussing the literary elements—figurative language, multiple layers of performance—which we can ignore if we focus on the differences between the vernacular and the “standard” language typically identified with literature.

Surprisingly, students were reluctant at first to engage the vernacular in classroom assignments. However, by challenge the assumption that learning in classroom should only take place in “standard” English, the language attitudes of students and instructors became more receptive and student writing showed greater linguistic awareness. Dr. Nero’s presentation demonstrated that by embracing and learning from the vernacular academics can obtain a culturally responsive pedagogy and a more inclusive attitudes toward multilingual and multiethnic students.

In addition to our keynote speakers, there were many other presenters that we would like to highlight, and we got some great feedback about the Crow web interface, too. Our next few posts will cover the various topics and offer more takeaways from our 2018 symposium.

Thank you, symposium participants!

That’s a wrap! Everyone on the Crow team would like to thank the presenters and attendees who made our Writing Research Without Walls symposium a success. We’ll share more here shortly, including information about our next steps with the Crow system. For a quick recap, have a look at the tweets from the conference. We’ll be archiving them shortly.

Dr. Susan Conrad presenting “Improving Writing Instruction in Engineering through Interdisciplinary Collaboration” at our 2018 symposium

As always, our thanks to the Humanities Without Walls consortium and the Andrew W. Mellon Foundation for their support.

Tagged with:

Symposium updates

As we make preparations for the Writing Research Without Walls symposium next week, we wanted to share a few concrete details with everyone.

  • Travel: The closest parking garage is the Grant Street Parking Garage on Grant Street (just north of State Street). For more travel information, see Purdue’s Visitor Information.
  • Location: All sessions will be held in the Stewart Center, Room 214, immediately east of the Purdue Memorial Union and a five minute walk from the Grant Street Garage.
  • Check in: You can check in and pick up your name tag outside Stewart 214 during breakfast. If you arrive later in the day, find a Crow team member.
  • Registration: If you have not yet registered, ask a Crow team member for help when you arrive.
  • Symposium Model: We do not have tracks of concurrent sessions. All attendees will have the opportunity to hear every presentation.
  • Program: We have posted the symposium program. If you see any errors, please accept our apologies and let us know so we can correct them.
  • Breakfast & Lunch: We provide a catered continental breakfast (8:00 -9:00 am) and box lunches (11:45 am – 12:15 pm) for all registrants. Dietary restrictions mentioned in your registration form have been taken into consideration.
  • Breaks: There will be short breaks between panels in addition to a half-an-hour catered coffee break at 2:30 pm.
  • Social Events: Dinner each night is on your own, but we will announce evening socials for presenters and attendees interested in mingling and further conversation.
  • Network access: All registrants will have access to Purdue’s wireless network.
  • Featured Sessions: Plenary talks are from 12:15 – 1:15 pm, and the Crow beta release is a one-hour interactive workshop (4:45 – 5:45 pm). Full abstracts for the plenary addresses are in the symposium program.
    • Susan Conrad Plenary (Friday Oct. 5): “Improving Writing Instruction in Engineering through Interdisciplinary Collaboration”
    • Crow beta release workshop (Friday Oct. 5): A workshop with the beta release of our web-based archive for research and professional development in writing studies
    • Shondel Nero Plenary (Saturday Oct. 6): “Engaging Vernacular Englishes through Literature in the Writing Classroom: Paradoxes, Pedagogy, Possibilities”
  • Tweeting: If you’ll be tweeting during the conference, please use the hash tag #wrww18.

Please contact us if you have questions. We look forward to seeing you soon.

Tagged with: , ,

Meet Our New Undergraduate Researchers, Emily & Sarah!

Picture of undergraduate researcher Emily Jones.

Emily Jones is a junior in Professional Writing at Purdue University.

Picture of undergraduate researcher Sarah Merryman.

Sarah Merryman is a senior in Professional Writing at Purdue University.

 

The Crow team is growing! Our team at Purdue has started off the school year with two new undergraduate researchers, Emily Jones and Sarah Merryman. Over the next year, they are looking forward to writing blog posts, applying for grants, sharing updates via Twitter, and learning new skills in coding and repository building.

Emily joined the Crow team over the summer after taking a class with team leader Bradley Dilger. This is her third year at Purdue, where she is studying Professional Writing with a minor in History and Creative Writing. She will primarily be focusing on content strategy and contributing to grant applications. She is currently applying her interest in design to help Crow establish an official logo.

Outside of Crow, Emily also works as Editorial Assistant for J-PEER, an engineering journal, interns with the literary magazine Sycamore Review, and participates in a research project on gendered violence in Victorian London. When she isn’t reading or writing, Emily enjoys exercising and cooking vegan food for (sometimes skeptical) friends. After graduating from Purdue, she hopes to work in editing and publishing for books or a magazine.

Sarah is a senior in Professional Writing with a minor in Communications. Sarah first heard about Crow after taking a class with Crow team leader, Bradley Dilger. In the spring of 2018, she started as a project intern and wrote blog posts about the Crow Methodology Workshop series. Now, Sarah is excited to join Crow as a full-fledged intern and is eager to become even more involved next semester. She is especially looking forward to constructing grant proposals and working on repository building. Through Crow she has also discovered her love of computer coding after years of technophobia. To her own surprise, she is voluntarily taking lessons on python coding from graduate lab practicum assistant Ge Lan—and loving it!

Aside from her work with Crow, Sarah also serves as a social media intern for the Purdue English Department and a Marketing and Assistant JTRP Editor at the Purdue University Press. She has yet to decide on a career path but is interested in community engagement and archival research. In her limited spare time, she enjoys reading historical fiction, watching old movies from the 1940’s (Cary Grant anyone?), treasure hunting for awesome clothes at garage sales, and reminiscing about the days when everyone used the Oxford comma.

Sarah and Emily are hoping to continue working with Crow during their remaining time at Purdue. They are excited to be a part of Crow’s continued growth and success!

Article: Corpus-informed instruction of reporting verbs

Crow team members Ji-young Shin, Ashley Velázquez, Ola Swatek, and Shelley Staples are pleased to announce the publication of an article with Purdue alum R. Scott Partridge. This is the first of several we expect to see coming out of our work with reporting verbs.

Examining the Effectiveness of Corpus-Informed Instruction of Reporting Verbs in L2 First-Year College Writing” appears in L2 Journal, issue 10.3. Here’s the abstract:

Previous research has shown that developing second language (L2) academic writers use a limited set of reporting verbs in comparison to more advanced writers (Biber & Reppen, 1998; Hinkel, 2003; Kwon, Staples, & Partridge, 2018; Neff et al., 2003; Staples & Reppen, 2016). These writers also tend to rely on verbs that are typical for conversation (Biber et al., 1999). The present study examines the effects of corpus-informed instruction on developing L2 writers’ learning of reporting verbs in a first-year writing course by comparing drafts of literature reviews before and after a workshop. The forty-five-minute workshop was designed to improve L2 writers’ lexical and functional uses of reporting verbs using corpus-informed materials. The researchers compared the literature review drafts written by 40 students who participated in the workshop to 38 randomly chosen drafts from our corpus. The results show an increase in the experimental groups’ reporting verb lexical variety and a decrease in the use of verb types used in speech in favor of types used in academic writing. The results suggest that corpus-informed instruction may support L2 writers in the development of lexical and functional reporting verb use.

Well done Ji-Young, Ashley, Ola, Shelley, and Scott!

2018 Teaching and Language Corpora (TaLC) conference workshop

Researchers from the Crow team have returned to the 2018 Teaching and Language Corpora (TaLC) conference in Cambridge, England, to update the TaLC community on the progress we’ve made since our last appearance. Team members (Ola Swatek, Hadi Banat, and Shelley Staples) had previously attended the 12th TaLC conference in 2016 to present our plans for developing the Crow platform, and this year we were actually able to debut the working prototype. This year approximately 25 conference delegates (as the local organizers called us) had the opportunity to attend our half-day workshop titled “Exploring variation and intertextuality in L2 undergraduate writing in English: Using the Corpus and Repository of Writing online platform for research and teaching.” The team guiding the workshop participants consisted of Adriana Picoral and Dr. Shelley Staples from the University of Arizona, alongside the following graduate students from Purdue University: Ji-young Shin, Aleksandra/Ola Swatek, and Zhaozhe Wang. Mark Fullmer joined us virtually from Tucson, Arizona.    The purpose of the workshop was to test out the Crow online interface with users who have never used it and collect user feedback from them. This was the first time that researchers and teachers outside of Crow had a chance to try the Crow online platform. This workshop provided us with the opportunity to get practical feedback from some of our target audience on the usability of Crow platform before our public release.

As we were preparing for the workshop, we discussed how different the audience at this conference might be from audiences at conferences in the United States. The data for our project is currently collected from first-year writing courses at our U.S. institutions. These courses could be unfamiliar to European audiences, which made it particularly important to describe the purpose of these courses to the audience.

The workshop began with Dr. Staples’ introducing the project to the attendees, who were mostly unfamiliar with our data and the educational context from which the data comes.

The introduction was followed by more hands-on activities since we are strong believers in the learning by doing approach. We have scaffolded our exercises to move from simple to more advanced functions so we promoted step-by-step, spiral learning. Ola started the first activity by introducing corpus search function,  and Adriana, Zhaozhe, and Ji-young guided the other activities one by one. Each time we were done with a part of the workshop, our team asked the participants to write down feedback related to a particular functionality of our interface. Such feedback was one of the key reasons we decided to beta-test the interface with this particular group of users. The workshop attendees, most of whom are fairly used to navigating other corpora interfaces, are one of the target users of our platform.

The main activities of the workshop included:

  • Introducing/practicing simple and advanced search and filter functions in Crow corpus
  • Applying corpus functions for pedagogical purposes
  • Introducing/practicing search and filter functions in Crow repository
  • Applying repository functions for pedagogical purposes, considering intertextuality between corpus and repository
  • Brainstorming research projects using Crow platform
  • Introducing/practicing how-to on coding tools and bookmarking corpus searches

During our workshop, the attendees asked many important questions about the current and future design of the Crow platform. These questions gave us insights into what some of our users might be interested in seeing the platform do in the future. Of particular interest were the search functions and the possibility to download corpus data with rich metadata.

Workshop participants noted that the search engine was “easy to use,” “intuitive,” and “accessible”—promising feedback, considering that we want our tool to be used not only by researchers but by teachers as well. Also, audience members  indicated interest in intertextuality opportunities offered by the link between repository and corresponding corpus materials for both research and pedagogical purposes. Most corpus tools publicly available don’t cater to teachers as audience; though this particular audience could certainly benefit from a clear and accessible corpora of classroom materials, they don’t always have the training necessary to navigate research-based corpus tools. Attendees also reportedly enjoyed the design of the landing page, as well as the detailed nature of the repository and its interlinked documents.

We also gathered valuable feedback on what the audience would like to see improved in the platform. Among some of the suggestions was distinguishing corpus and repository interfaces with color scheme, improving the concordancing view, changing the font size and color, improving the search function. There were many more comments added in the survey we conducted and we plan to use that knowledge to make the interface as user friendly as possible.

The TaLC experience was not only focused on academic interactions. The city of Cambridge has much to offer in terms of its beautiful scenery, rich history, and unique culture. Graduate students of our team went on a punting tour and got to learn the cultural and social history behind the most prominent colleges and their students at Cambridge University. It was a truly memorable experience.

Here’s a video prepared by the organizers showing what you’ve missed, if you haven’t attended the Teaching and Language Corpora conference.

 

Workshop 3: Reflecting halfway

Midway through our programming workshop series, Methodology Workshop for Natural Language Programming workshop, the Crow team took a step back to reflect on what direction we wanted to take in our work and the role programming and coding would play in our long-term mission. Much of our discourse revolved around our identity as a research group, specifically concerns about creating a self-sustaining organization that can adapt to change.

Our discussion started with questions of achievement:

  • What tasks must we perform to accomplish our current milestone such as TALC and the Symposium. What additional milestones do we want to carry out?
  • How do we differentiate between internal resources, tools, and deliverables and external materials that could be shared with our partners?

From there, we generalized into more existential questions, such as our identity as a team and how that identity has changed because of these workshops.

  • What does it mean to be a Crow researcher?
  • What criteria will future Crow members need to meet?
  • Will the coding and programming skills covered during this conference be a standard expectation for all team members or only a select few?
  • How do we train incoming members?

To address these issues, we discussed creating personas of the different positions within the Crow team and use them to generate a specific set of criteria for each position. Our discussion then segued into the purpose of existing Crow members and how their personal goals intersect with team goals. Should current members be involved with Crow until the end of their PhD studies? How big of a role should fourth year students play in Crow? Bradley shared his view that all fourth-year students should be applying for fellowships in their specific fields of expertise and members who acquire fellowships should be helpfully “fired” and allowed to pursue these new opportunities. However, a great deal will depend on how much the student’s personal objective dovetailed with Crow research.

To that end, further dialogue was devoted to the need for more Crow meetings and greater articulation of member goals. A prime area of concern was balancing the coordination of internal management with the myriad of other obligations on the Crow agenda. How do we sustain the succession of new students when current members leave?  Financially, how do we continue our work if our grant requests are not successful? (The idea of counterfeiting was briefly considered but unanimously vetoed). Furthermore, What are the next steps once we achieve the creation of a working corpus? Do we want our research to simply be open sourced or should we create a service model where access to Crow is free but users would be charged for support services from Crow members?

As the session concluded, we decided that questions regarding preparation for TALC would be revisited at the summit meeting. Topics of immediate concern, such as how to transition workflow once students graduate and revisiting our one, two, and five-year plans, were scheduled for future discussion.

Workshop 2: Thinking Like a Programmer  

In this series of posts, we reflect on the Methodology Workshop for Natural Language Programming workshop, coordinated by Crow developer Mark Fullmer and hosted at Purdue.

In our first session of the workshop, our Crow team gained a basic understanding of how to approach coding. The principles we took away from the session gave us a solid theoretical framework with which to build practical coding skills. We’ll be learning Python given its simplicity, flexibility, and adaptation to the text processing which is a part of Crow research.

Principle 1: Automate everything. Automating as much of our data as possible not only increases our efficiency and accuracy, it also gives us the ability to plug pre-programmed segments into future projects with minimal fuss or extra effort.

Principle 2: Separation of concerns. Like many things in life, programming is easier to do when broken down into small steps, each one performing a different function like assembly lines in a factory. Step by step, we worked through an example to consider the process of writing code. First, we separated words into individual entities using a delimiter, such as a comma. Next, we scanned the words through the computer system, issuing a frequency count for each word. Lastly, we displayed the list for frequency analysis. Splicing our code into separate “factories” provides two advantages: 1) the code can easily be recycled for future programs. It is much more efficient to tweak small segment of code then rewrite an entire program. 2) It’s easier to test the accuracy of our program when its broken up into short code segments. Simply modify your test when you want to reach a different result for that portion of code.

Principle 3: Don’t make assumptions. We learned that creating code on the assumption that the results we need now will be the same results we need tomorrow is a crucial mistake. For instance, hardwiring a text processor to remove all apostrophes will make it useless if down the road we need to analyze possessive nouns. Instead, it is better to create an optional “factory” that can be removed or upgraded to obtain the desired result. Also, we shouldn’t assume that a computer can read the text in its current format. Elements such as capitalization, punctuation, and character encoding are not read by computers the way we read them and must be cleaned from a text before it can be analyzed.

Principle 4: Avoid hardcoding. When labeling our different “factories” we should leave room for flexibility or else the name won’t match the function when we make changes. We must maintain a balance between generalization and specificity.

Principle 5: Keep it simple, stupid. A hallmark of good Pythonic code is that the simplest methods are used even if it requires writing more code.

Principles 6 and 7: Convention over configuration, and Write your code for the next programmer. Following standardized coding formats will make our code more accessible to other programmers than if we personalize code to our own preference. To further help other programmers decipher our work, it is helpful to add inline commentary that explains difficult aspects of our code and to follow the formulaic syntactic already in existence.

Principle 8: Don’t repeat ourselves. The goal with coding is to reuse and recycle. Instead of rewriting a slightly different version of the same code for multiple different programs, we should write the code once then modify it for different uses. Writing code is much like creating a résumé: build one, but tailor multiple drafts to different employers.

Principle 9: We don’t want to write new code unless it’s absolutely necessary. More than likely, someone else has already written the code we need, so there is no point in reinventing something we can borrow. This principle is the programmer’s version of “think smarter not harder.”

We concluded our session by discussing these principles and articulating them in other ways to see how much we took understood the principles. After learning more about the coding process, Crow members felt confident to move forward into writing actual Python code. More on that in our next post!

Top