What I do

Research & Projects

Computational corpus linguistics  ·  Digital discourse  ·  Language data science  ·  AU$16M+ in research funding

LADAL Director LDaCA CI ARDC HASS Digital Champion 2019
Corpus linguistics Digital discourse Language variation & change Social media analysis R / RStudio NLP & machine learning Statistical modeling Acoustic phonetics Reproducibility & open science FAIR data principles Research infrastructure

Research Profile

In my research I have mainly taken a variationist approach to statistically modelling linguistic variation within and across non-standard and post-colonial varieties of English. While I have mostly relied on corpus data and used R and RStudio to investigate linguistic phenomena, I have also engaged in acoustic analyses of audio data using the Bavarian Archive for Speech Signals’s The Munich Automatic Segmentation System (MAUS) and Praat, and have used questionnaire designs and experimental methods including web-based eye-tracking during my time at the Arctic University of Norway. I also have an interest in language documentation, anthropological linguistics, and fieldwork methods.

One basic issue underlying my research is to investigate how, why, and to what extent social, cultural, and psychological factors shape linguistic behaviour — not merely to describe differences between speaker groups, but to elucidate the mechanisms driving language change. I have become particularly interested in how semantically bleached elements such as discourse particles and degree adverbs (intensifiers) pattern across varieties and change over time. An article for which I won the ISLE Richard M. Hogg Prize in 2015 exemplifies this line of work, investigating psycholinguistic and social determinants of discourse eh in New Zealand English.

A major current focus is the analysis of discourse on social media, especially vulgarity and impoliteness in online and public discourse. Another abiding interest is reproducibility and open science in corpus linguistics — I have given talks including a plenary at ICAME 42 and co-edited a special issue of IJCL on this topic.

In 2019, I was a Digital Champion on behalf of the Australian Research Data Commons (ARDC) and I continue to promote digital methods in HASS by offering workshops on computational methods and giving talks about best practices in research data management and methodology.

In my current role at the School of Languages and Cultures, I established the Language Technology and Data Analysis Laboratory (LADAL). LADAL is a free, open-source, collaborative support infrastructure for digital and computational humanities, established in 2019 at the University of Queensland. Its tutorials and interactive notebooks have been accessed by over 500,000 users worldwide and have been viewed over 1.1 million times.


Research Projects

1. Language Data Commons of Australia (LDaCA)

Australia holds large and diverse collections of language data, many stored in short-lived repositories or in siloed institutional archives. The aim of LDaCA is to consolidate these into a nationally integrated research data infrastructure of high strategic importance — establishing governance, access policy, shared technical infrastructure, a discovery portal, and community engagement to secure and make accessible Australia’s linguistic heritage, including Indigenous language materials.

My contribution focuses on producing open-access training and support resources through LADAL and on building the text analytics capacity of the broader HASS research community. The Australian Text Analytics Platform (ATAP), which I co-led as Chair of the User Advisory Group from 2021–2023, was developed under this umbrella before being integrated into LDaCA in 2024. ATAP filled the gap between generic text analytics tools and highly specialised custom code, developing an integrated, notebook-based platform for processing and analysing language data.

  • Scheme: HASS Research Data Commons (RDC) and Indigenous Research Capability Program (ARDC)
  • Role: CI & Steering Committee (Project lead: Michael Haugh)
  • Funding period: 2021–2029 (Phase I: 2021–2023; Phase II: 2024–2029)
  • ATAP scheme: ARDC Platforms Program, 2021–2023

Selected publications: LADAL chapter (Springer, 2021) · Replication crisis chapter (CUP, 2025) · Reproducibility & corpus linguistics, IJCL 2025 · COVID-19 Twittersphere, Big Data & Society 2021 · in prep.: Introducing LADAL (Language Resources and Evaluation)

Selected talks: ICAME 42 Plenary · ICAME 42 LADAL talk · ARDC eResearch Summit 2019 · ARDC Webinar 2019 · LDaCA & LADAL, LiRI Zürich 2024 · LADAL UEF 2020 · Computational Thinking in the Humanities 2022


2. Vulgarity and Bad Language in Online and Public Discourse

This is one of my most active current research lines, combining corpus-linguistic, sociolinguistic, and computational methods to investigate vulgarity, profanity, and impoliteness in digital and public discourse across varieties of English. Using large-scale Twitter/X corpora and spoken data, I examine the frequency, distribution, social patterning, and discourse functions of vulgar language across Australia, the UK, and the USA — and increasingly in spoken interaction across World Englishes more broadly.

The project spans multiple funded sub-projects: a UQ Digital Cultures and Societies Hub project on swearing on Twitter (2022–2023, AU$15,075), a School of Languages and Cultures Research Support project with Svenja Kranich at Universität Bonn (2024, AU$3,000), and contributions to an international collaborative network. I co-edited a special issue of Lingua (Bad Language and Vulgarity Online and in Public Discourse, 2025/26) with Paula Rautionaho and Kate Burridge, and have an article under contract as the editorial introduction to that volume. This research has attracted widespread media attention, including coverage on Channel 10 News, five ABC Radio programmes, The Guardian, Der Spiegel, CNN, and over 96 other media reports reaching an estimated audience of 205 million with an advertising equivalent of AU$1.9 million.

  • Funding: UQ Digital Cultures and Societies Hub (AU$15,075, 2022–2023); SLC Research Support Scheme (AU$3,000, 2024)
  • Collaborators: Kate Burridge (Monash), Michael Haugh (UQ), Mikko Laitinen & Paula Rautionaho (Univ. of Eastern Finland), Sam Hames (UQ), Marissa Takahashi (QUT), Svenja Kranich (Univ. Bonn), Amir Sheikhan (UniSA)

Selected publications: Vulgarity in Online Discourse, Lingua 2025 · Vulgarity intro, Lingua 2026 · submitted: F%$# Twitter (Corpora) · submitted: Swearing online (Digital Scholarship in the Humanities) · submitted: Pop Goes Profanity · The Conversation, 2025

Selected talks: FoEiA 2025 · ICAME 46 2025 · LMU Munich 2025 · UniSA 2025 · Universität Bonn 2024 · Universität Bayreuth 2023 · F%$# Twitter, LDaCA Digital Fellowship 2024


3. Computational Analysis of Digital Discourse and Social Media

Beyond vulgarity, I pursue a broader research line in computational corpus analysis of digital and social media discourse. This encompasses text-analytic approaches to born-digital data including topic modelling, sentiment analysis, network analysis, keyword and collocation analysis, and computational discourse analysis. Key topics include metadiscourse in social media, address terms and politeness on Twitter across national varieties of English, conspiracy discourse on Reddit, and data-driven approaches to discourse structure.

This work is methodologically oriented as well as empirically substantive: I have made contributions to the development and evaluation of text-analytic methods themselves — for instance publishing on seeded topic modelling as a more appropriate alternative to unsupervised LDA for corpus-pragmatic research (Discourse Studies, 2024) and on accountability and meta-reflection in corpus-based discourse analysis (Corpus Linguistics and Linguistic Theory, 2024). A two-book contract with Cambridge University Press and Edinburgh University Press reflects the depth of this research line.

Selected publications: Seeded topic modelling, Discourse Studies 2024 · Corpus-based discourse analysis, CLLT 2024 · COVID-19 Twittersphere, Big Data & Society 2021 · Metadiscourse in Twitter, Routledge Handbook of Metadiscourse, 2024 · NLP meets corpus linguistics, Routledge 2024 · submitted: Conspiracy discourse on Reddit (Applied Corpus Linguistics) · submitted: Address terms on Twitter (World Englishes) · in prep.: Data-Intensive Pragmatics (Cambridge UP)

Selected talks: ALS 2020 COVID-19 Twitter · Data Science MeetUp Brisbane 2020 · FoEiA 2025 corpus standardisation


4. Adjective Amplification in English

This long-running research project investigates the amplifier (or intensifier) systems of varieties of English — how degree adverbs such as very, really, absolutely, and totally combine with gradable adjectives, how their frequencies and collocational preferences vary across national varieties and social groups, and how innovations spread through speech communities over time. The project spans diachronic corpus analysis, sociolinguistic modelling, learner corpus research, and L1-acquisition studies, and has produced some of my most widely cited work including the 2015 ISLE Richard M. Hogg Prize-winning article on discourse eh in New Zealand English.

Core questions include: what determines whether a new intensifier succeeds or fails in the lexical competition (totally vs awfully); whether psycholinguistic priming effects drive variability and change; how L2 learners acquire the variable use of amplifiers alongside native-speaker norms; and whether the same sociolinguistic constraints operate across Irish, Australian, New Zealand, Hong Kong, Indian, and Philippine English.

  • Funding: School of Languages and Cultures Targeted Research Support Scheme (AU$5,440, 2020–2021); ZFF Universität Kassel (€3,000, 2018–ongoing)

Selected publications: Adjective amplification, World Englishes 2024 · Australian English amplifiers, AJL 2021 · Irish English intensifiers, Anglistik 2021 · very among L2 learners, IJLCR 2020 · AmE fiction, JB 2020 · Modeling amplification, JB 2022 · Waning forms, JB 2021 · under revision: Priming in NZE (ELL) · in prep.: L1-acquisition of amplifier variation

Selected talks: JAECS 2020 HK/Indian/Philippine · ISLE 6 2021 waning forms · ICAME 40 2019 · ICAME 38 2017 · ISLE 4 2016 · ALS 2018 · AcqVA Aurora 2021 · Universität Innsbruck 2023


5. Discourse-Pragmatic Variation and L1 Acquisition

This project continues the research that grew out of my PhD dissertation on the discourse marker like in varieties of English. The overarching questions concern how pragmatic innovations — particularly discourse particles and markers — diffuse through speech communities, and at what age and in which order distinct functional variants are acquired by L1 speakers. I am investigating whether extra-linguistic social factors are acquired alongside the linguistic constructions themselves or whether the pragmatic structures are acquired first and sociolinguistic variation is added post-hoc. The project draws on corpus data from Irish, New Zealand, and British English and combines quantitative sociolinguistic modelling with developmental perspectives.

More broadly, this line of work connects to corpus pragmatics and the analysis of socio-pragmatic variation in Ireland and Scotland, documented in the edited volume I co-edited with Patricia Ronan (De Gruyter, 2024).

Selected publications: L1-acquisition of like, Functions of Language 2023 · like variability & change, IJGL 2019 · Speech-unit final like, EWW 2020 · like in Irish English, JB 2012 · Socio-pragmatic variation in Ireland & Scotland, De Gruyter 2024 · PhD dissertation, 2014

Selected talks: ICAME 33 Leuven 2012 · LCTG3 Greifswald 2011 · ICAME 31 Gießen 2010 · Leuphana Universität Lüneburg 2012


6. Language Documentation of Southern Low German

This project aims to create a speech corpus of one of the southernmost, as yet undocumented, varieties of Low German — Oberweserplatt, spoken in the region around the upper Weser river — to enable its documentation shortly before extinction. Oberweserplatt is endangered and has received no systematic linguistic documentation to date, making corpus creation urgent. The corpus is being developed through fieldwork using semi-automated transcription methods and will capture multiple registers and text types, with rich metadata about speakers, their social networks, and recording contexts. Beyond its archival and documentary value, the corpus will serve as a resource for variationist, contact-linguistic, and typological research on Low German.


7. Learner Language and Applied Linguistics

This project brings together my empirical and pedagogically-oriented work on how L2 learners of English develop their linguistic competence across multiple domains, including grammar, pragmatics, and discourse. A central concern is the relationship between learner corpus research and language teaching: by examining how learners pattern relative to native-speaker norms in large-scale corpora, this line of work generates actionable insights for curriculum design, written corrective feedback, and classroom pedagogy.

Key contributions include studies on how L1 background shapes the acquisition of amplifier variation (connecting directly to the Adjective Amplification project), an analysis of written corrective feedback and L2 error resolution (Journal of Second Language Writing, 2020), an investigation of Indonesian teacher trainees’ perceptions of data-driven learning (Applied Corpus Linguistics, 2021), a study of EAP students’ genre awareness in blended learning environments (Journal of English for Academic Purposes, 2021), and work on the use of corpus tools for L2 teaching and learning at undergraduate level. A broader methodological contribution is the examination of how learner corpus research and instructed second language acquisition can be combined as a productive methodological synergy (John Benjamins, 2025). Most recently, this line of work extends into intelligibility and pronunciation research, including a submitted paper on listener perceptions of intelligibility in multilingual contexts.

  • Collaborators: Peter Crosthwaite (UQ), Naomi Storch (Univ. Melbourne), Yon Visioni (UQ), Yuki Komiya

Selected publications: Corpora and ISLA, JB 2025 · Written corrective feedback, JSLW 2020 · Indonesian teacher trainees, Applied Corpus Linguistics 2021 · Genre awareness blended learning, JEAP 2021 · Learner corpus research & teaching, ARAL 2020 · very among L2 learners, IJLCR 2020 · submitted: Intelligibility & pronunciation (ARAL)

Selected talks: Corpus technology for language learning 2021 · Corpus technology invited talk 2021 · CALL Research Conference 2025


8. VowelChartProject

The VowelChartProject developed a pedagogically-oriented tool for acoustic phonetics in the L2 classroom, combining corpus-based acoustic analysis with individualised student feedback. The project extracted vowel formants (F1 and F2) from recordings of German, Russian, and Spanish learners of English using Praat and R, measuring the degree of target-language proximity for each vowel and for word-final devoicing. Students produced word lists and a short story, and subsequently received a personalised vowel chart plotting their formant values alongside native-speaker norms for their L1 group — giving them concrete visual evidence of where their vowel production diverges from the target.

The project ran across three funding phases (€77,860 total from BMBF via L3Prof, Universität Hamburg, 2016–2018) and has since evolved into a broader research line on computational acoustic analysis of learner speech. More recent work extends the methods to L1-Japanese and L1-Chinese learners of English and deploys machine-learning classifiers to distinguish learner from native-speaker vowel production.

  • Funding: BMBF via L3Prof, UHH (€77,860 across three phases, 2016–2018)

Example vowel chart showing F1 and F2 frequencies for learners and native speakers

Example vowel chart from the VowelChartProject

Selected publications: Komiya & Schweinberger, JB 2024 · Komiya & Schweinberger, SST 2022

Selected talks: ISLE 7 2023 L1-Chinese · ICAME 44 2023 · ICAME 43 2022 L1-Japanese · LiRI Zürich 2024 invited · Universität Kiel 2022 · Universität Freiburg 2023 · CuTLi 2017 VowelChartProject

(last updated 2026/05)

Back to top