Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”


Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Big Data Collection and Management in R

Matt W. Loftis

Aarhus Universitet

Matt is an associate professor of political science at Aarhus University in Denmark where he teaches courses in data science, quantitative methods, and political control of bureaucracy.

His research addresses various dimensions of democratic political accountability, from delegation and blame avoidance to politicians' responsiveness to societal problems.

Twitter  @mattwloftis

Course Dates and Times

Monday 3 ꟷ Friday 7 August 2020
2 hours of live teaching per day
Courses will be either morning or afternoon to suit participants’ requirements

Prerequisite Knowledge

Experience with R is a major plus for this course but is not required if you’re willing to put in some extra time to get up to speed. 

You are strongly encouraged to join the course with a data-collection goal in mind: a website from which you want to scrape data, a web service you want to access, or some data processing problems you want to solve or automate.

Short Outline

This course provides a highly interactive online teaching and learning environment, using state of the art online pedagogical tools. It is designed for a demanding audience (researchers, professional analysts, advanced students) and capped at a maximum of 16 participants so that the teaching team (the Instructor plus one highly qualified Teaching Assistant) can cater to the specific needs of each individual.

Purpose of the course

This course will teach you the principles, and give you the tools, for collecting and managing large amounts of data from a wide variety of online sources. 

You will learn to use R for the full process of collecting, managing, and cleaning a large data archive. This includes pulling structured or unstructured data from various types of websites and web-based application programming interfaces (APIs), creating a well-organised archive that can be managed automatically from R, and then drawing from that archive to clean and process raw data into a finished data set ready for analysis.

Beyond these tools, we will discuss and apply principles and ethics for collecting data online and how to manage data in a way consistent with transparent and reproducible open science practices.

ECTS Credits

3 credits Engage fully with class activities 
4 credits Complete a post-class assignment

Long Course Outline

Key topics for this course include web technologies like HTML, AJAX, and various types and subtypes of APIs (REST, CKAN, etc.). We will review each of these at varying levels of depth.

We will, of course, cover a suite of tools implemented in the R language for collecting and processing data, as well as for automating and easing large coding projects. 

As we will be writing software that automatically accesses resources on the internet, an additional key topic for the course includes the ethics and best practices of operating bots and collecting internet data.

How the course will work online

In advance of the course, I will distribute a variety of reading materials and guides preparing you, and your computer, to do the coursework.

During the course, you will start each day by reviewing written or video guides to introduce the day’s new concepts and tools. We’ll then meet for a live discussion of the day’s course material. After that, you will interact with R on your own to do daily exercises via a pre-configured server of the integrated development environment RStudio, accessed on the RStudio cloud service.

You can choose to stay connected to the group or to work individually on daily practice exercises. We will close each day’s meetings with a live Q&A or workshop-style session to discuss the material and assignments, and to prepare for the following day. I will also be available for individual questions or requests outside live course meetings, for troubleshooting or additional support. 

Each day, I’ll set a homework problem that stretches your ability to apply that day’s lessons. Group work is encouraged.

Additional Information


This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc). Registered participants will be informed at the time of change.

By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, please contact us before registering.