Advanced Text-As-Data: Word Embeddings, Deep Learning, and Large Language Models

Curso ministrado em inglês

In recent years, the surge in the availability of textual data, ranging from the digitalization of archival documents, political speeches, social media posts, and online news articles, has led to a growing demand for statistical analysis using large corpora. Once dominated by sparse bag-of-words models, the field of Natural Language Processing (NLP) is now increasingly driven by dense vector representations, deep learning, and the rapid evolution of Large Language Models (LLMs). This course offers an introduction to this new generation of models, serving as a hands-on approach to this new landscape of computational text analysis with a focus on political science research and applications.

The class will cover four broad topics. We start with an overview of how to represent text as data, from a sparse representation via bag-of-words models, to a dense representation using word embeddings. We then discuss the use of deep learning models for text representation and downstream classification tasks. From here, we will discuss the foundation of the state-of-the-art machine learning models for text analysis: transformer models. Lastly, we will discuss several applications of Large Language Models in social science tasks.

The course will consist of lectures and hands-on coding in class. The lecture will be conducted in English, but students are free to ask questions in Portuguese. Students will have time in the mornings to practice the code seen in class, and we will suggest additional coding exercises. We assume students attending this class have taken, at a minimum, an introductory course in statistics and have basic knowledge of probability distributions, calculus, hypothesis testing, and linear models. The course will use a mix of R and Python, two computational languages students should be familiar with. That being said, students should be able to follow the course even if they are just starting with any of the two programming languages.

Tiago Ventura

Professor Assistente em Ciência Social Computacional na McCourt School of Public Policy da Georgetown University. Antes de ingressar em Georgetown, foi Pesquisador de Pós-Doutorado no Center for Social Media and Politics da NYU. Recebeu seu Ph.D. em Ciência Política pela University of Maryland, College Park, onde ainda é professor afiliado ao iLCSS. Após completar seu doutorado, teve experiência na indústria como Pesquisador de Desinformação no Twitter.

Sebastian Vallejo Vera

Professor Assistente no Departamento de Ciência Política da University of Western Ontario. É também diretor do Laboratório Interdisciplinar de Ciência Social Computacional - México (iLCSS) e Pesquisador Associado no Hewlett Packard Enterprise Data Science Institute (University of Houston). Anteriormente, foi Pesquisador de Pós-Doutorado no Departamento de Ciência Política da University of Houston e Professor Assistente no Tecnológico de Monterrey, México. Obteve seu Ph.D. em Ciência Política pela University of Maryland, College Park.

Advanced Text-As-Data: Word Embeddings, Deep Learning, and Large Language Models

Formato: Remoto (online)

Data e horário: 7 a 10 de julho, das 14h às 17h