Open-access materials

Resources

Free tutorials, interactive tools, R scripts, data sets, and course materials · All open-access via LADAL and GitHub

ladal.edu.au GitHub

Tutorials

Below are links to tutorials I created for the Language Technology and Data Analysis Laboratory (LADAL).

DATA SCIENCE BASICS

Working with Computers — setting up a research-ready digital environment; file and folder organisation, data storage, and digital workflows (doi: 10.5281/zenodo.19332989)
Introduction to Data Management — basic data management techniques: folder organisation, file naming conventions, and data documentation practices (doi: 10.5281/zenodo.19332868)
Reproducible Research — principles of reproducibility, version control basics, documentation strategies, and reproducible workflows (doi: 10.5281/zenodo.19332935)
Introduction to Quantitative Reasoning — the scientific method, empirical reasoning, and why quantitative methods matter for humanities research (doi: 10.5281/zenodo.19329151)
Basic Concepts in Quantitative Research — variables, observations, measurements, and the distinction between descriptive and inferential statistics (doi: 10.5281/zenodo.19329153)

R BASICS

Why R? — reasons for choosing R for language research
Getting Started with R — installing R and RStudio, basic R syntax, variables, functions, and your first R script (doi: 10.5281/zenodo.19332947)
Loading and Saving Data in R — reading CSV, Excel, plain text, and other file formats; writing data back to disk (doi: 10.5281/zenodo.19332909)
String Processing in R — text manipulation with stringr: concatenation, splitting, replacement, and pattern matching (doi: 10.5281/zenodo.19332969)
Regular Expressions in R — character classes, quantifiers, anchors, capture groups, and lookahead/lookbehind (doi: 10.5281/zenodo.19332943)
Handling Tables in R — creating and manipulating data frames, subsetting, reshaping, merging, and tabulating data with tidyverse tools (doi: 10.5281/zenodo.19332975)
Working with R: Control Flow, Functions, and Programming — loops, conditionals, custom functions, and programming patterns in R
Reproducibility with R — R Markdown, Quarto, version control with Git, and R Projects for organised workflows (doi: 10.5281/zenodo.19332937)

How-to guides:

Creating R Notebooks with R Markdown and Quarto — reproducible analysis documents with embedded code, output, and narrative text (doi: 10.5281/zenodo.19332919)
Publishing with Bookdown and Quarto — creating and publishing long-form documents and websites from R
Creating Interactive Jupyter Notebooks — building interactive teaching materials and reproducible research notebooks (doi: 10.5281/zenodo.19332892)

DATA COLLECTION AND ACQUISITION

Compiling a Corpus: From Texts to Analysis-Ready Data — principles and practice of corpus design, data collection, cleaning, and metadata organisation (doi: 10.5281/zenodo.19332645)

How-to guides:

Downloading Texts from Project Gutenberg — batch downloading and cleaning public domain texts via the gutenbergr package (doi: 10.5281/zenodo.19332882)
Web Scraping with R — extracting text and data from websites using rvest
Simulating Data with R — generating synthetic datasets for method testing and teaching
Converting PDFs to Text — extracting text from PDF files for further processing

DATA VISUALIZATION

Introduction to Data Visualization in R — principles of effective visualisation; scatter plots, bar charts, line plots, and box plots with ggplot2 (doi: 10.5281/zenodo.19332890)
Mastering Data Visualization with R — advanced ggplot2: faceting, small multiples, combining plots, and interactive visualisations (doi: 10.5281/zenodo.19332872)
Interactive Visualizations in R — animated and interactive graphics with plotly and gganimate
Conceptual Maps — spring-layout visualisations of semantic similarity using igraph and ggraph
Geo-spatial Data Visualization with R — creating typological and distribution maps with leaflet

Showcase tutorials:

Comparing Methods for Conceptual Maps — comparing word co-occurrence (PPMI), TF-IDF, and GloVe as inputs for conceptual maps (doi: 10.5281/zenodo.19332087)
Creating Vowel Charts in R — extracting formant values in Praat and plotting vowel charts in R

STATISTICS

Descriptive Statistics with R — mean, median, variance, standard deviation, distributions, and outlier detection (doi: 10.5281/zenodo.19332864)
Basic Inferential Statistics using R — t-tests, chi-square tests, correlation, and interpretation of p-values (doi: 10.5281/zenodo.19329155)
ANOVA, MANOVA, and ANCOVA using R — analysis of variance for one or more outcome variables and covariates (doi: 10.5281/zenodo.19329144)
Regression Concepts — the conceptual foundations of regression modelling
Regression Analysis in R — linear, logistic, and ordinal regression with model diagnostics and reporting (doi: 10.5281/zenodo.19332945)
Mixed-Effects Models in R — random effects, contrast coding, model fitting with lme4, and interpretation (doi: 10.5281/zenodo.19332913)
Structural Equation Modelling — latent variable models and path analysis with lavaan
Tree-Based Models in R — decision trees, random forests, and ensemble methods for linguistic data (doi: 10.5281/zenodo.19242479)
Cluster and Correspondence Analysis in R — hierarchical clustering, k-means, and correspondence analysis (doi: 10.5281/zenodo.19242479)
Introduction to Lexical Similarity — measuring lexical overlap and distance between texts or varieties
Semantic Vector Space Models in R — PPMI matrices, LSA via SVD, and GloVe embeddings for semantic analysis (doi: 10.5281/zenodo.19332955)
Dimension Reduction Methods — PCA, MDS, t-SNE, and UMAP for high-dimensional linguistic data
Power Analysis — sample size planning and effect size estimation for linguistic studies

Showcase tutorials:

Practical Phylogenetic Methods for Linguistic Typology — genealogically-sensitive methods using glottoTrees; by Erich Round & Martin Schweinberger
Reinforcement Learning and Text Summarisation — applying RL to NLP; text summarisation with reinforcement learning
Designing and Analyzing Survey and Questionnaire Data — survey design, Likert scales, and statistical analysis of questionnaire data
Eye-Tracking Data Analysis in R — preprocessing and analysing eye-tracking data for psycholinguistic research

TEXT ANALYTICS / TEXT MINING / CORPUS LINGUISTICS

Introduction to Text Analysis: Concepts and Foundations — key concepts, core methods, and research design considerations for corpus linguistics and digital humanities
Introduction to Text Analysis: Practical Implementation in R — concordancing, word frequency, collocations, keywords, POS tagging, NER, and dependency parsing (doi: 10.5281/zenodo.19332976)
Finding Words in Text: Concordancing with R — KWIC displays, search patterns, regular expressions, and filtering concordances (doi: 10.5281/zenodo.19332093)
Collocation and N-gram Analysis in R — measuring collocation strength with MI, t-score, log-likelihood, and other association measures
Keyness and Keyword Analysis in R — identifying vocabulary over- or under-represented in a target corpus relative to a reference corpus
Tagging and Parsing with R — part-of-speech tagging and dependency parsing in 60+ languages using UDPipe
Network Analysis using R — building and visualising linguistic networks; igraph, network metrics, community detection (doi: 10.5281/zenodo.19332917)
Topic Modelling with R — Latent Dirichlet Allocation (LDA): fitting, tuning, interpreting, and visualising topic models (doi: 10.5281/zenodo.19332979)
Sentiment Analysis in R — polarity scoring and eight basic emotion categories using the NRC lexicon (doi: 10.5281/zenodo.19332959)
Automated Text Summarisation with R — extractive and abstractive text summarisation methods
Spell Checking with R — spell checkers, OCR error correction, custom dictionaries, and batch processing (doi: 10.5281/zenodo.19332967)
Readability Analysis in R — Flesch, Gunning Fog, SMOG, Coleman-Liau, and Dale-Chall readability measures (doi: 10.5281/zenodo.19659678)
Word Embeddings and Vector Semantics — word2vec, fastText, and GloVe: training, loading, and applying pre-trained embeddings
BERT and RoBERTa in R: Transformer-Based NLP — fine-tuning and applying transformer models for text classification and NER
Deep Learning with R: Recurrent Neural Networks and TensorFlow — LSTMs and GRUs for sequence modelling in linguistics
Local Large Language Models in R with Ollama — running LLMs locally; text generation, classification, NER, summarisation, and embeddings (doi: 10.5281/zenodo.19332921)
Privacy-Preserving Analysis with Local LLMs — analysing sensitive data without cloud APIs using locally-hosted language models

Showcase tutorials:

Classifying American Political Speeches — document classification using machine learning and text features
Topic Modelling of Charles Dickens’ Novels — iterative STM workflow; interpreting topics for social criticism and literary realism; by Gerold Schneider, Max Lauber & Martin Schweinberger
Corpus Linguistics with R — gender and age differences in swearing; exemplifies a complete corpus-based analysis in Irish English
Analysing Learner Language — computational approaches to second language acquisition research
Computational Literary Stylistics with R — concordancing, keyword analysis, stylometry, and character network visualisation on literary texts (doi: 10.5281/zenodo.19332905)
Lexicography and Creating Dictionaries with R — computational lexicography and dictionary construction

Software Development / Programming / Tools

Below are links to interactive browser-based tools I created for LADAL. Each tool is a Jupyter notebook that runs in your browser — no installation required.

Concordancing Tool — generates KWIC (keyword-in-context) displays of words or phrases across uploaded texts; results downloadable as Excel or CSV
Collocation Tool — calculates association measures (MI, t-score, log-likelihood) to identify phraseological patterns in uploaded texts
Keyword Tool — identifies vocabulary that is statistically over- or under-represented in your texts compared to a reference corpus using G², chi-squared, and log-ratio
POS Tagging Tool — adds part-of-speech tags to texts in 60+ languages using UDPipe; tagged output downloadable
Corpus Cleaning Tool — removes or replaces words, XML/HTML tags, URLs, and other patterns across uploaded text files
Network Analysis Tool — builds and visualises word co-occurrence networks from uploaded texts; downloads networks as PNG and tables as Excel or CSV
Topic Modelling Tool — generates topic models using LDA and downloads results as an Excel spreadsheet
Sentiment Analysis Tool — scores texts for positive/negative polarity and eight basic emotion categories; results downloadable

For Students

General Notes for Students attending my Courses (Merkblatt für Seminare)
You will find a document with general information about my seminars here. Please read this document if you are attending or plan to attend one of my seminars! (last updated 2015/02/16)

Model term paper
You will find a model term paper here. This model term paper includes information about the structure, content, and formatting of term papers. You can also use it as a template and use the formatting within the model. (last updated 2015/04/08)