Corpora

We have made several corpora available to researchers, teachers, and students for corpus analysis using an online interface (with restricted access to full texts, to avoid copyright violations):

  • Australian Brown corpus: a corpus of written Australian English (1931-2006), ca. 1 million words, compiled by Peter Collins and Xinyue Yao
  • Australian budget speeches: a corpus of Australian budget speeches by Labor and Liberal politicians (1981-2019), ca 200,000 words, compiled by Annabelle Lukin
  • The Macquarie Laws of War Corpus (MQLWC): a corpus of all documents in the International Committee of the Red Cross International Humanitarian Law database (1856 – 2019), ca 392,000 words, compiled by Annabelle Lukin and Rodrigo Araújo e Castro.
  • The Diabetes News corpus (DNC): a corpus of Australian newspaper articles on diabetes (2013-2017), ca 250,000 words, compiled by Monika Bednarek and Georgia Carr
  • The Sydney Corpus of Television Dialogue: a corpus of dialogue from US American fictional television series, ca 275,000 words, compiled by Monika Bednarek

You can access the corpus search interface by clicking here: CQPweb. To create an account click here: create account.

We strongly recommend you read the corpus documentation (where available) to better understand the contents of these corpora. The corpora are lemmatised, part-of-speech tagged, and semantically tagged. You can undertake frequency analysis, collocation analysis, keyness analysis, concordancing, etc.

Thanks go to Chao Sun, Andressa Rodrigues Gomide, Andrew Hardie, Prihantoro, and Michael Lynch for help with CQPweb, and to Peter Collins, Xinyue Yao, and Annabelle Lukin for sharing their corpora.

Information about new corpus tools that we developed can be found under Resources.