ETL Process
Contents
ETL Process¶
The Extraction-Transformation-Loading process consists of gathering and preparing data for data modelling, analysis, and visualizations. It consists of 3 main stages:
Extraction: extracting the raw data from one or more sources (via an API or a simple bulk download of tabular data, for instance);
Transformation: transforming the data through feature selection, calculations, field conversions, joins or concatenation to other pertinent datasets etc. This steps usually takes the most time, as this is the stage during which analysts gather domain knowledge of the data’s scheme and properties (and usually identify the shortcomings or limitations of the dataset).
Loading: transferring the transformed dataset onto an online warehouse system (such as a Cloud storage application like Amazon Web Services, or a code repository like GitHub) which can be accessed, iterated upon, and updated.
1. Extraction¶
Lyft’s Bay Wheels historical ridership dataset represents the bulk of the raw data. The data is stored within zipped csv files for each month of a given year, so I clicked and downloaded all the files within my date range of interest: i.e. January 2019 - August 2021.
I unzipped the files and saved each csv into a folder corresponding to the year the data was created (thus, I had three folders for 2019, 2020, and 2021). I then created a helper function to combine the monthly dataset into a single csv for each given year.
You might think why not just combine all of the yearly dataset into one csv file? which is not an unreasonable thought. Wrangling data, in my experience, is an iterative process where incongruities in the raw dataset are uncovered throughout a series of trials and errors during the Transformation process. With these particular sets of data, I observed that their schema were not consistent between years (2019’s data schema was changed due to rebranding). So dealing with each yearly set at once was less strenuous down the line.
What could be improved:
Writing a batch script to download the csv files directly from Lyft’s website;
Issue: Lyft does not provide an API to call upon their historical dataset. Furthermore, due to the rebranding of Bay Wheels from Ford Go Bike in mid-2019, which is reflected in the filepath convention to the files, that added a layer of difficulty to work around with.
Writing a batch script to unzip and store the zip files locally, without having to manually define the filepaths;
Issue: I tried! But the zip files each contained a folder dedicated with a csv file formatted for MacOS, which I did not need to extract. Dealing with these issues cumulatively were starting to creep into the time I had dedicated for ETL; programmatic solutions here would have saved me 5-10 minutes at most.