Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Monday 6 – Friday 10 February 2023
Minimum 2 hours of live teaching per day
14:00 – 16:00 CET
This course provides a highly interactive online teaching and learning environment, using state-of-the-art online pedagogical tools. It is designed for a demanding audience (researchers, professional analysts, advanced students) and capped at a maximum of 16 participants so the teaching team can cater to the specific needs of each individual.
You will learn how to work with various sizes of data using R, borrowing the power of tidyverse.
Through coding tutorials, you’ll also learn the basic principles of machine learning, and how to use them.
3 credits Engage fully with class activities
4 credits Complete a post-class assignment
Akitaka is a Postdoctoral Research Fellow at the Institute for Analytics and Data Science (IADS). Before joining IADS, he was a Research Fellow in Data Science in LSE's Department of Methodology. He earned his PhD in political science at Rice University in Houston.
His research interests lie in data science and politics, in particular in the statistical methodology for scaling political behaviour, and natural language processing of political texts.
The course has two core topics:
For the first topic, big data, we start by asking: What is big data? Why it is difficult to work with? We then learn the best solutions in R for working with big data depending on the size of data, from locally stored data objects to databases hosted on the cloud.
For the second topic, machine learning, we will learn basic concepts such as:
In statistical analysis, social science traditionally has emphasised the explanation of some outcomes with some inputs. Machine learning has an overlapping but evidently different orientation.
By contrasting the inference-based approach with the prediction-focused approach, you get to understand the fundamental ideas of machine learning.
After theoretical discussions, we will apply machine learning techniques to various analytical tasks in social sciences.
Data Management in R
We discuss big data, what it is, and why it's difficult to work with. We cover: setting up computation environments in the cloud, introduction to cloud computing concepts, and explain why the migration to cloud computing is occurring. You will work with large data using tidyverse, which provides a unified interface for various data solutions, from dataframe to databases.
Databases and Machine Learning Basics
We learn the methods for handling big data using various types of databases, which unload the data from R working space while keeping it highly accessible. Explore the basics of relational databases, such as concepts, rules, and design followed by the syntax of SQL queries. Participate in general discussions on the logic of machine learning to understand the basics.
Regression problem
Revisiting linear regression, we learn how to interpret the regression problem in the machine learning framework, covering cases where the outcome is a continuous quantity. We explore the topic of variable selection, especially when there are numerous input features, and learn about widely used methods for regularised regressions such as Ridge regression and LASSO. We also introduce the concept of (hyper-)parameter tuning through cross-validation.
Classification Methods 1
Moving to cases where the outcome is categorical, we discuss supervised classification, in which we apply machine learning algorithms to predict known values of outputs. Using logistic regression, a classification method familiar to social scientists, you will learn how to evaluate the models in a machine learning framework. Then we learn the concept and implementation of support vector machine (SVM).
Classification Methods 2
Continuing from the previous day, we focus on tree-based methods. Starting with a simple tree, we'll move on to more sophisticated methods such as random forest and boosting. To finish, we will revisit the issue of data size and get an overview of the methodology for distributed computation.
The course includes approximately five hours of pre-recorded lectures, plus an online forum on Slack where the Instructor and students can freely discuss the lecture materials and coding.
Approximately two hours of each day will be an online seminar, during which we will learn how to apply the concepts and knowledge gained from pre-course lecture materials through Q&A sessions and live lab work.
In the live lab, you will be given several coding tasks, and asked to code along with the Instructor. Some tasks are left as homework, which we will discuss on the online forum and during the following day’s live lab. You can also have an opportunity to meet the Instructor virtually during daily office hours.
You will learn how to work with cloud computational workspace using R as a primary statistical software, with RStudio server. You will set up a cloud computing environment on Amazon Web Service by yourself. We will distribute example scripts and assignments through github so you can learn how to use online version control systems for collaboration and research accountability.
The course assumes you have some familiarity with R statistical language and can conduct basic data handling in R (opening .csv files, working with dataframes). If you don’t have this, we suggest you take the Introduction to R course.
You should also have basic knowledge of standard statistical analysis in social science, such as linear regression and hypothesis testing. These are covered in the Introduction to Inferential Statistics course.
Before Day 1, please complete the following:
Each course includes pre-course assignments, including readings and pre-recorded videos, as well as daily live lectures totalling at least three hours. The instructor will conduct live Q&A sessions and offer designated office hours for one-to-one consultations.
Please check your course format before registering.
Live classes will be held daily for three hours on a video meeting platform, allowing you to interact with both the instructor and other participants in real-time. To avoid online fatigue, the course employs a pedagogy that includes small-group work, short and focused tasks, as well as troubleshooting exercises that utilise a variety of online applications to facilitate collaboration and engagement with the course content.
In-person courses will consist of daily three-hour classroom sessions, featuring a range of interactive in-class activities including short lectures, peer feedback, group exercises, and presentations.
This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc.). Registered participants will be informed at the time of change.
By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, please contact us before registering.