Language Data Commons of Australia

Rescuing vulnerable language collections.
Coworkers are sitting down on the floor and discussing

The Challenge

Australia is a massively multilingual country, in one of the world’s most linguistically diverse regions. Significant collections of this intangible cultural heritage have been amassed, including collections of Australian Indigenous languages, regional languages of the Pacific, and Australian English.

There are also language collections important for cybersecurity (AusTalk, Australian National Corpus, corpora of regional languages), for gauging popular sentiment (Australian Twitter Corpus), and for emergency communication (languages of the region and some Indigenous languages).

However, much of Australia’s language data is scattered, hard to find, and in danger of being lost. Many collections remain under-used and researchers lack the tools and skills to exploit their research potential.

The Response

We’ve established the Language Data Commons of Australia (LDaCA), an integrated national infrastructure that supports language research. It enables researchers and communities to access and use nationally significant collections of written, spoken, multi-modal and signed text.

The project is:

  • improving researchers’ digital skills and raise awareness of best practice in digital research
  • rendering valuable collections of national significance more findable, accessible, interoperable and reusable (FAIR) while adhering to CARE principles
  • developing the integrated national technical infrastructure to analyse language collections at scale.

It supports researchers to deliver innovative research outcomes, and opens up the social and economic possibilities of Australia’s language data for translational research in the national interest.

LDaCA:

  • addresses the challenge of balancing research needs while respecting community rights for language and cultural collections
  • highlights contributions that language research and HASS disciplines can make to STEM research and non-academic applications
  • positions Australia internationally as a leading contributor of language collections and digital infrastructure.

LDaCA has not only built an integrated national technical infrastructure for language data, it is also contributing to the success and impact of the HASS and Indigenous RDC by creating foundational infrastructure. It is also positioning Australia internationally as a leading contributor of language collections and digital infrastructure.

The Australian Text Analytics Platform is also part of the Language Data Commons of Australia.

Who Will Benefit

LDaCA gives researchers more widespread access to Australia’s rich language resources, accelerating the development of language data analysis capability in Australian research and industry.

The Partners

LDaCA is part of the ARDC’s HASS and Indigenous Research Data Commons. It previously received support from the ARDC through:

Our partners are:

  • The University of Queensland (lead)
  • ARDC
  • Australian National University
  • Monash University
  • The University of Melbourne
  • The University of Sydney
  • AARNet
  • First Languages Australia
  • Australian Institute for Aboriginal and Torres Strait Islander Studies
  • PARADISEC
  • ARC Centre of Excellence for the Dynamics of Language
  • Digital Observatory (QUT)
  • CLARIN

Outcomes

LDaCA is a sustainable long-term repository for language data collections of national significance. This has implications for the development of Australia’s economy, national security and social and cultural well-being.

The work of LDaCA to date has been focused on the sustainability of data as well as offering tools and training for the collection and analysis of language data. Our achievements towards this goal include:

  • developing policies and governance structures for long-term data storage and access
  • developing a technology stack which enables secure storage and provides a basis for tools and services now and in the future
  • establishing relationships with various communities to encourage sustainable data management and data (re)use practices
  • developing notebooks that enable researchers to learn how to apply text analytics to their own data or collections held in LDaCA.

To date, LDaCA has:

  • given 17 conference presentations
  • presented over 40 workshops, reaching ~1000 people
  • secured 25 dataset and built 24 data migration tools
  • created 75 software repositories, including some public tools, such as an RO-Crate profile, a metadata vocabulary, and a GUI tool for working with those resources, Crate-O.
  • engaged with 8 Indigenous communities/organisations in the development process.

Key Resources

Timeframe

Ongoing

Current Phase

In progress

ARDC Co-investment

$3,794,101

Project lead

Professor Michael Haugh, School of Languages and Cultures, The University of Queensland