Understanding Corpus Linguistics: A report on teaching corpus linguistics for the first time at the Australian National University

Written by Danielle Barth (Australian National University, ARC Centre of Excellence for the Dynamics of Language)

Last semester (S1 2022) I taught Corpus Linguistics, the first time it has been offered at the Australian National University. It ran as a “special topics” course but has now officially been added to our regular offerings. This course is one that we have probably needed for a while, given how much corpus linguistics and corpus development we see at ANU from staff and students and given the resurgence of corpus linguistics in Australia more generally. I was very happy to teach it and to use my new textbook Understanding Corpus Linguistics, co-written with Stefan Schnell, which came out at the end of 2021. Corpus Typology is a relatively new field but an important way of understanding differences between languages. Understanding Corpus Linguistics (see Figure 1) is the first linguistics textbook to include a chapter on corpus typology. Stefan Schnell also has a corpus typology project, MultiCAST, which is described in the textbook.

Figure 1 Understanding Corpus Linguistics (book cover)

It would be a little bit obtuse to write about teaching excitement without acknowledging that the past two years have been extremely difficult for students and staff (including me). Lockdowns, uncertain states of the world, increased caring responsibilities, family losses, and getting to grips with new technology have combined to make teaching and learning harder than ever.  The pandemic has changed a great deal of how we approach teaching in the university sector. Some of these adaptations are likely to remain for the foreseeable future, and some are certainly beneficial, like increased accessibility and flexibility for students.

In any case, the Corpus Linguistics course saw us move from the zoomasphere to the hybrid/dual-delivery environment at ANU. I was actually a little surprised to find out that students really wanted to be back in the classroom! We had a large group of students that came in-person to each class, a smaller group that was always online and a few students that had to move back and forth due to illness. We had a three-hour block (with a nice chit-chatty coffee break) and several online asynchronous activities like quizzes to practice R programming skills and regular discussion forum posts. Students had to do three projects and a presentation of a research article during the semester.

We covered the principles of corpus linguistics using Understanding Corpus Linguistics and investigated many examples of corpus linguistic studies through our readings. I aimed for a wide exposure to corpus methods and theoretical areas. This part of the course also served to practice critical reading and presentation skills. We also spent a small chunk of time on an increasingly important academic skill: managing expectations and productivity in stressful times. A few students have let me know that they appreciated this. We exchanged some strategies that have been working for us (I learned some new techniques as well from my students) and acknowledged that self-care needs to be part of academia.

Our assessments included comparing corpora and their uses, building and analysing corpus data found on the web using subtitle transcripts, and finally a project based on a project I co-lead with Prof Nick Evans: SCOPIC (Social Cognition Parallax Interview Corpus). This was a very specific kind of research-led teaching: Students recorded short videos and then annotated the files in ELAN using specifications from our SCOPIC project (see Figure 2). I think it is really important for students to understand the amount of work and the myriad of decisions that go into transcription (and translation), something that is seemingly very basic until you have to do it yourself. Further, it is useful for students to understand that annotation is not a straightforward process, but involves many theoretical decisions. What is great about corpus annotation is that it often requires us to confront and revaluate our theories.

Figure 2 Elan annotation example

Then we exported data out of ELAN and used R to reformat the data into a format that we could use to label categories and do some basic descriptive statistics. The ELAN-to-R pipeline is something I have created scripts for and this was a chance to train students to do the same. This will be valuable for many of the students in the class who are interested in doing linguistic fieldwork. Students analysed their data for their own project, but then our last week of class was spent using a class dataset. I combined everyone’s data into one dataset (we had five languages including multiple English recordings) and then as a class we went through some more advanced analysis techniques and some troubleshooting – especially with ggplot graphs! It was also a great chance to discuss metadata, including aspects of metadata and personal data we have to be careful about in different cultural contexts. It felt a little self-indulgent to have students do something as specific as ELAN-to-R corpus typological analyses, but I also think there are many parts of this workflow that will be useful for students. Feedback from students reflected that they liked being able to see the process going from data collection to theoretical outcomes.

I am very pleased that the School of Culture, History and Language at ANU will now offer a biannual Corpus Linguistics course (although I was told I should probably rethink the name). Corpus linguistics is both a theory and a method and both aspects have benefits for students who use data. Corpus linguistics provides general data science skills useful for all kinds of jobs. It also shows how many small decisions (what do you with contractions? how do you label code switching? where did you source the data? what do you do with uneven metadata?) add up and matter for analyses. Corpus linguistics requires a great deal of thoughtfulness that is not always apparent by reading a finished study. It is therefore really important for students to experience this for themselves. Corpus linguistics also requires a great deal of problem solving: how to go from a question to data and sometimes, what question can you ask given your data? Corpus linguistics also requires project documentation and accountability. In many cases, there are several decisions that could be made that are all reasonable. It is part of the researcher’s duty to record and motivate their choices. All of these are general skills that are valuable in- and outside of academia. For all these reasons, I am happy we had a successful corpus linguistics course that will continue at the ANU.

Understanding Corpus Linguistics was a really good base for the course because it provided so much background that allowed us to use our course time to focus on more practical aspects of corpus building, analysis and research discussion. Given the number of international students in the course, and in Australia generally, I was really glad that our textbook and our reading selection provided excellent examples of research from a wide range of languages. Based on student feedback and my own reflection of the course, I think the next time around we will spend a little more time on applied corpus linguistics, as this is an area with a lot of exciting and problem-solving research.

A summary of my Corpus Linguistics course reading and activity plan can be found on my website.