Open-access materials
Resources
Free tutorials, interactive tools, R scripts, data sets, and course materials · All open-access via LADAL and GitHub
Tutorials
Below are links to tutorials I created for the Language Technology and Data Analysis Laboratory (LADAL).
DATA SCIENCE BASICS
- Working with Computers — setting up a research-ready digital environment; file and folder organisation, data storage, and digital workflows (doi: 10.5281/zenodo.19332989)
- Introduction to Data Management — basic data management techniques: folder organisation, file naming conventions, and data documentation practices (doi: 10.5281/zenodo.19332868)
- Reproducible Research — principles of reproducibility, version control basics, documentation strategies, and reproducible workflows (doi: 10.5281/zenodo.19332935)
- Introduction to Quantitative Reasoning — the scientific method, empirical reasoning, and why quantitative methods matter for humanities research (doi: 10.5281/zenodo.19329151)
- Basic Concepts in Quantitative Research — variables, observations, measurements, and the distinction between descriptive and inferential statistics (doi: 10.5281/zenodo.19329153)
R BASICS
- Why R? — reasons for choosing R for language research
- Getting Started with R — installing R and RStudio, basic R syntax, variables, functions, and your first R script (doi: 10.5281/zenodo.19332947)
- Loading and Saving Data in R — reading CSV, Excel, plain text, and other file formats; writing data back to disk (doi: 10.5281/zenodo.19332909)
- String Processing in R — text manipulation with stringr: concatenation, splitting, replacement, and pattern matching (doi: 10.5281/zenodo.19332969)
- Regular Expressions in R — character classes, quantifiers, anchors, capture groups, and lookahead/lookbehind (doi: 10.5281/zenodo.19332943)
- Handling Tables in R — creating and manipulating data frames, subsetting, reshaping, merging, and tabulating data with tidyverse tools (doi: 10.5281/zenodo.19332975)
- Working with R: Control Flow, Functions, and Programming — loops, conditionals, custom functions, and programming patterns in R
- Reproducibility with R — R Markdown, Quarto, version control with Git, and R Projects for organised workflows (doi: 10.5281/zenodo.19332937)
How-to guides:
- Creating R Notebooks with R Markdown and Quarto — reproducible analysis documents with embedded code, output, and narrative text (doi: 10.5281/zenodo.19332919)
- Publishing with Bookdown and Quarto — creating and publishing long-form documents and websites from R
- Creating Interactive Jupyter Notebooks — building interactive teaching materials and reproducible research notebooks (doi: 10.5281/zenodo.19332892)
DATA COLLECTION AND ACQUISITION
- Compiling a Corpus: From Texts to Analysis-Ready Data — principles and practice of corpus design, data collection, cleaning, and metadata organisation (doi: 10.5281/zenodo.19332645)
How-to guides:
- Downloading Texts from Project Gutenberg — batch downloading and cleaning public domain texts via the gutenbergr package (doi: 10.5281/zenodo.19332882)
- Web Scraping with R — extracting text and data from websites using rvest
- Simulating Data with R — generating synthetic datasets for method testing and teaching
- Converting PDFs to Text — extracting text from PDF files for further processing
DATA VISUALIZATION
- Introduction to Data Visualization in R — principles of effective visualisation; scatter plots, bar charts, line plots, and box plots with ggplot2 (doi: 10.5281/zenodo.19332890)
- Mastering Data Visualization with R — advanced ggplot2: faceting, small multiples, combining plots, and interactive visualisations (doi: 10.5281/zenodo.19332872)
- Interactive Visualizations in R — animated and interactive graphics with plotly and gganimate
- Conceptual Maps — spring-layout visualisations of semantic similarity using igraph and ggraph
- Geo-spatial Data Visualization with R — creating typological and distribution maps with leaflet
Showcase tutorials:
- Comparing Methods for Conceptual Maps — comparing word co-occurrence (PPMI), TF-IDF, and GloVe as inputs for conceptual maps (doi: 10.5281/zenodo.19332087)
- Creating Vowel Charts in R — extracting formant values in Praat and plotting vowel charts in R
STATISTICS
- Descriptive Statistics with R — mean, median, variance, standard deviation, distributions, and outlier detection (doi: 10.5281/zenodo.19332864)
- Basic Inferential Statistics using R — t-tests, chi-square tests, correlation, and interpretation of p-values (doi: 10.5281/zenodo.19329155)
- ANOVA, MANOVA, and ANCOVA using R — analysis of variance for one or more outcome variables and covariates (doi: 10.5281/zenodo.19329144)
- Regression Concepts — the conceptual foundations of regression modelling
- Regression Analysis in R — linear, logistic, and ordinal regression with model diagnostics and reporting (doi: 10.5281/zenodo.19332945)
- Mixed-Effects Models in R — random effects, contrast coding, model fitting with lme4, and interpretation (doi: 10.5281/zenodo.19332913)
- Structural Equation Modelling — latent variable models and path analysis with lavaan
- Tree-Based Models in R — decision trees, random forests, and ensemble methods for linguistic data (doi: 10.5281/zenodo.19242479)
- Cluster and Correspondence Analysis in R — hierarchical clustering, k-means, and correspondence analysis (doi: 10.5281/zenodo.19242479)
- Introduction to Lexical Similarity — measuring lexical overlap and distance between texts or varieties
- Semantic Vector Space Models in R — PPMI matrices, LSA via SVD, and GloVe embeddings for semantic analysis (doi: 10.5281/zenodo.19332955)
- Dimension Reduction Methods — PCA, MDS, t-SNE, and UMAP for high-dimensional linguistic data
- Power Analysis — sample size planning and effect size estimation for linguistic studies
Showcase tutorials:
- Practical Phylogenetic Methods for Linguistic Typology — genealogically-sensitive methods using glottoTrees; by Erich Round & Martin Schweinberger
- Reinforcement Learning and Text Summarisation — applying RL to NLP; text summarisation with reinforcement learning
- Designing and Analyzing Survey and Questionnaire Data — survey design, Likert scales, and statistical analysis of questionnaire data
- Eye-Tracking Data Analysis in R — preprocessing and analysing eye-tracking data for psycholinguistic research
TEXT ANALYTICS / TEXT MINING / CORPUS LINGUISTICS
- Introduction to Text Analysis: Concepts and Foundations — key concepts, core methods, and research design considerations for corpus linguistics and digital humanities
- Introduction to Text Analysis: Practical Implementation in R — concordancing, word frequency, collocations, keywords, POS tagging, NER, and dependency parsing (doi: 10.5281/zenodo.19332976)
- Finding Words in Text: Concordancing with R — KWIC displays, search patterns, regular expressions, and filtering concordances (doi: 10.5281/zenodo.19332093)
- Collocation and N-gram Analysis in R — measuring collocation strength with MI, t-score, log-likelihood, and other association measures
- Keyness and Keyword Analysis in R — identifying vocabulary over- or under-represented in a target corpus relative to a reference corpus
- Tagging and Parsing with R — part-of-speech tagging and dependency parsing in 60+ languages using UDPipe
- Network Analysis using R — building and visualising linguistic networks; igraph, network metrics, community detection (doi: 10.5281/zenodo.19332917)
- Topic Modelling with R — Latent Dirichlet Allocation (LDA): fitting, tuning, interpreting, and visualising topic models (doi: 10.5281/zenodo.19332979)
- Sentiment Analysis in R — polarity scoring and eight basic emotion categories using the NRC lexicon (doi: 10.5281/zenodo.19332959)
- Automated Text Summarisation with R — extractive and abstractive text summarisation methods
- Spell Checking with R — spell checkers, OCR error correction, custom dictionaries, and batch processing (doi: 10.5281/zenodo.19332967)
- Readability Analysis in R — Flesch, Gunning Fog, SMOG, Coleman-Liau, and Dale-Chall readability measures (doi: 10.5281/zenodo.19659678)
- Word Embeddings and Vector Semantics — word2vec, fastText, and GloVe: training, loading, and applying pre-trained embeddings
- BERT and RoBERTa in R: Transformer-Based NLP — fine-tuning and applying transformer models for text classification and NER
- Deep Learning with R: Recurrent Neural Networks and TensorFlow — LSTMs and GRUs for sequence modelling in linguistics
- Local Large Language Models in R with Ollama — running LLMs locally; text generation, classification, NER, summarisation, and embeddings (doi: 10.5281/zenodo.19332921)
- Privacy-Preserving Analysis with Local LLMs — analysing sensitive data without cloud APIs using locally-hosted language models
Showcase tutorials:
- Classifying American Political Speeches — document classification using machine learning and text features
- Topic Modelling of Charles Dickens’ Novels — iterative STM workflow; interpreting topics for social criticism and literary realism; by Gerold Schneider, Max Lauber & Martin Schweinberger
- Corpus Linguistics with R — gender and age differences in swearing; exemplifies a complete corpus-based analysis in Irish English
- Analysing Learner Language — computational approaches to second language acquisition research
- Computational Literary Stylistics with R — concordancing, keyword analysis, stylometry, and character network visualisation on literary texts (doi: 10.5281/zenodo.19332905)
- Lexicography and Creating Dictionaries with R — computational lexicography and dictionary construction
Software Development / Programming / Tools
Below are links to interactive browser-based tools I created for LADAL. Each tool is a Jupyter notebook that runs in your browser — no installation required.
- Concordancing Tool — generates KWIC (keyword-in-context) displays of words or phrases across uploaded texts; results downloadable as Excel or CSV
- Collocation Tool — calculates association measures (MI, t-score, log-likelihood) to identify phraseological patterns in uploaded texts
- Keyword Tool — identifies vocabulary that is statistically over- or under-represented in your texts compared to a reference corpus using G², chi-squared, and log-ratio
- POS Tagging Tool — adds part-of-speech tags to texts in 60+ languages using UDPipe; tagged output downloadable
- Corpus Cleaning Tool — removes or replaces words, XML/HTML tags, URLs, and other patterns across uploaded text files
- Network Analysis Tool — builds and visualises word co-occurrence networks from uploaded texts; downloads networks as PNG and tables as Excel or CSV
- Topic Modelling Tool — generates topic models using LDA and downloads results as an Excel spreadsheet
- Sentiment Analysis Tool — scores texts for positive/negative polarity and eight basic emotion categories; results downloadable
For Students
General Notes for Students attending my Courses (Merkblatt für Seminare)
You will find a document with general information about my seminars here. Please read this document if you are attending or plan to attend one of my seminars! (last updated 2015/02/16)
Model term paper
You will find a model term paper here. This model term paper includes information about the structure, content, and formatting of term papers. You can also use it as a template and use the formatting within the model. (last updated 2015/04/08)