STAT 410/710 Data Collection & Acquisition (Strategies and Platforms)
2 Version 1.6, 12/20/2021
This pipeline typical includes components such as data collection, storage, cleaning,
reshaping, exploratory analysis, modeling, and reporting. The start of this pipeline, the data
acquisition activity, is often given little attention in most analytics related courses.
There are multiple courses available to students that focus on data science programing
applications (specifically Python and R) and there are courses focused on analytics which
usually include statistical and machine learning tools. In these courses the data are
typically provided to students directly. Consequently, a large part of the practical side of
the pipeline is ignored; specifically, how should you collect your data?
From a business perspective, knowledge about how to thoughtfully collect data translates
to improved efficiency and reliability. By using a suitable data collection strategy there can
be clear potential time and financial benefits for an organization.
The primary goal of this course is to fill the data acquisition gap and thus enhance the
student’s understanding of the complete data science pipeline. At the same time, important
current ideas such as data confidentiality and ethical considerations will be included.
The course is structured in two parts. There is a “Strategies” component that addresses
different data collection strategies. It will discuss sample designs, experimentation, and
observational studies. The focus is not on a deep dive into the methodological analyses of
these data but rather the recognition of, and the pros and the cons of the different
approaches. When and why should you use each one?
The second part of the course is about “Platforms” and goes into the practicalities of the
implementation of the different strategies. Given the data science perspective of this
course, this is focused on web enabled approaches. R and/or Python familiarity are
prerequisites for this course, so we will leverage these skills.
GOALS
At the end of the course, students will have a solid grasp of different data collection
strategies and when and how they can be applied in practice.
They will have:
(i) Designed and fielded an online sample survey.
(ii) Designed and fielded an online experiment (A/B test).
(iii) Collected data through web scraping activities and/or using an API.
(iv) Summarized their collected data and subsequent inferences, culminating with an
in-class presentation.
PREREQUISITES
Familiarity with either R or Python is expected and specifically the R-Studio or Jupyter
notebooks platforms. Courses such as Stat 477/777 or 405/705 would meet this