Back to overview Registration for this course is closed

From Digital Content to Discovery

Semester

Semester 2, 2025-2026

Type of course

Methodological and Practical Courses

Date

April 9, 2026

Location

University of Groningen

Duration

1 day

Maximum number of participants

ECTS

0.5 EC will be appointed for participation in the complete course

Staff

Claudia C. Kitz

From Digital Content to Discovery: Web-Scraping and Unsupervised Machine Learning for Social Science Research

Content

As social life increasingly takes place online, web-scraping and unsupervised machine learning provide social scientists with powerful tools to access and analyze large-scale digital data—transforming words into numbers to reveal new perspectives on social processes. This course introduces participants to the core concepts, tools, and practical applications of web-scraping and unsupervised machine learning, with a focus on how they can be used to support original research in the social sciences.

The day familiarizes participating students with web-scraping and unsupervised machine learning methods and how these approaches can be applied to research in the social sciences. In the first part of the day, we will discuss the kinds of research questions that can be addressed using online data and unsupervised techniques. We will focus on the practical challenges and benefits of using web-scraping and unsupervised learning in social research, including data processing, analysis techniques, and model interpretation. We will illustrate these approaches by discussing examples of research that use large-scale digital trace data (e.g., online platforms, or forums). We will also share practical tips and tools for building scalable scraping pipelines and choosing appropriate unsupervised methods to answer different research questions (e.g., (structural) topic modeling, sentiment analysis).

At the end of the day, participants will form small groups to discuss how web-scraping and unsupervised machine learning could be used to answer research questions relevant to their PhD projects. During this time, each group will work on a short pitch for a research study that incorporates digital data and unsupervised methods.

Time schedule

10:15 – 10:30: Arrival and Welcome

10:30 – 12:30: Introduction to web scraping and unsupervised machine learning

12:30 – 13:00: Lunch Break

13:00 – 14:00: Applied Examples

14:00 – 14:15: Coffee Break

14:30 – 16:00: Develop your research idea in groups

16:00 – 16:45: Pitch your ideas

16:45 – 17:00: Wrap-up

17:00: Goodbye and Drinks (optional)

Learning goals

By the end of the workshop, participants will:

Understand the ethical and methodological foundations of web-scraping in social research
Gain hands-on experience with tools for collecting and cleaning online text data
Learn the principles behind common unsupervised machine learning techniques such as clustering and topic modeling
Be able to critically assess the strengths and limitations of these methods for different types of research questions
Develop a basic prototype or research idea that integrates digital data and unsupervised learning into their own work

Literature

Compulsory:

Campion, E. D., & Campion, M. A. (2025). A review of text analysis in human resource management research: Methodological diversity, constructs identified, and validation best practices. Human Resource Management Review, 35(2), 101078.
Speer, A. B., Perrotta, J., Tenbrink, A. P., Wegmeyer, L. J., Delacruz, A. Y., & Bowker, J. (2023). Turning words into numbers: Assessing work attitudes using natural language processing. Journal of Applied Psychology, 108(6), 1027.

If there are more PhDs interested in participating than available places, distribution will be based on juniority for this course. This means that we will consider who became a KLI member more recently.