Drawing

Citi Bike Ridership & Public Safety During COVID-19

Tania Arya and Mausam Patel


Introduction

In the face of our nation's ongoing battle with the public health crisis, COVID-19, the Citi Bike electric rental bike stations in New York City, NY are still in business. These bikes are points of human contact and unfortunately can serve as vectors for the spread of germs and viruses such as COVID-19. Citi Bike is privately owned by Lyft, but is a public bike sharing system that leverages mobile technology to serve regions in the Bronx, Manhattan, Queens, and Jersey City, NJ. Their organization and technical capabilities extend beyond their app and into the realm of data collection as the company stores detailed information on all bike rides since 2013. Their faithful data collection offers an interesting dataset to conduct further analysis to gain insights on anything from coordinate data, station locations, gender analysis of riders, and even the durations of trips.

However, for the purposes of our analysis, we would like to answer the question: How has Citi Bike ridership behavior changed after the start of the COVID-19 pandemic? More generally, we would like to use insights gained from our data to see if we can increase the safety of Citi Bike locations in terms of minimizing the spread of the virus. We hypothesize that there will be a difference between the frequency of rides taken, most popular stations, the ridership patterns with respect to the trips taken and time of the trips.


Libraries Used

We will be using a variety of different Python libraries throughout this tutorial. We have outlined the core libraries below:

  1. Pandas - This is a very popular data analysis tool that allows you to easily work with tabular data in Python. The heart of pandas lies in a DataFrame object, which is essentially the table holding all of your data. Pandas comes with a wide variety of built-in functions that allow you to perform many different operations on the data. We will make use of several of these functions throughout the tutorial.

  2. Plotly.py - This is an open source graphing library that allows you to create interactive visualizations such as line charts, bar charts, and histograms. All Plotly graphs can be hovered over and clicked on to show additional information. They can also be zoomed in and out based on the users preference. Since the visualizations we will be making are relatively simple, we will be using Plotly Express, which is a shorter and simpler high-level version of Plotly.

  3. Folium - This is a powerful data visualization library that lets us plot coordinates on an interactive Leaflet map. These maps contain all the streets and landmarks you would be able to find on Google Maps, but with an added level of interactivity. Most of these maps require you to use latitude and longitude to annotate and add onto the base map.

  4. Scikit-Learn - This is an extremely popular machine learning and analysis library that provides simple and efficient tools used for predictive data analysis. It is open source, and is built upon several other popular Python libraries like NumPy, SciPy, and matplotlib.

All of the code and datasets used for this tutorial can be found at this repository.


Data Collection

In this data collection phase, we will focus on collecting and compiling our data in one place so that it is usable for our analysis.

We want to obtain information on all of the rides that have occurred within our designated time period using Citi Bike’s publicly available trip data. This data includes:

We have downloaded the csv files for every month from Jan 2019 - Oct 2020 (found here). To be able to proceed with our analysis, we need to combine these individual months into a single structure. Since each of these files can contain up to 50,000 entries, combining all of this data directly would not be possible due to memory constraints. To mitigate this error, we randomly sample 50% of the data from each month when appending to our DataFrame.

We have included the code used to create our final DataFrame below. Since its construction is based on random sampling, we executed this code to create our final dataset (citibike_compiled_data.csv). For the purpose of the tutorial, we will simply load the pre-created csv file into our DataFrame.

import random
import os

# extract all data files in "datasets" folder (contains downloaded .csvs for all months)
files = os.listdir("datasets")

df = pd.DataFrame()

# we will sample 50% of entries from each month
p = 0.50

# loop through all files, random sample entries, and append to the master dataframe
for file in files:
    df = df.append(pd.read_csv(f"datasets/{file}", skiprows=lambda i: i>0 and random.random() > p), ignore_index=True)

# save dataframe as csv
df.to_csv("citibike_compiled_data.csv", index=False)

The DataFrame "rides" now holds all of our ride data that we will continue to use throughout the rest of the tutorial. We can see that this dataset contains 354,305 rides ranging from 1/1/2019 - 10/31/2020.


Data Cleaning

Before we go on to analyze the data, we need to “fix” the organization and structure of the dataset. This process is known as “data tidying” or “data wrangling.”

Adjusting Data Types

Since we are dealing with data over time, we want to make sure that the time and date columns are easy to work with. We will make use of Python datetime objects to accomplish this.

We can now see that the starttime and stoptime columns are datetime objects.

Converting Trip Duration to Minutes

To obtain more useful and clear analysis and visualizations, we will convert the trip duration, which is currently given in seconds, into minutes. As many trips are longer than 2 mins, seconds quickly become hard to understand and differentiate in terms of length.

Removing Extraneous Trips

Another factor we want to consider is trip duration. Since we are trying to find trends associated with the trips, we want to ensure that we remove any trips that may be outliers with reasonable assumptions. For example, we will remove any trips that are under 2 mins as we assume these are either mistakenly rented bikes, user error in the checkout process, or the user was simply testing the bike. We will then remove any trips that are over 5 hours (300 minutes) in length as we assume that the bike was improperly checked in, stolen, lost, etc.

As we've seen in the previous section, the data given already contains a column for trip duration, so we do not need to calculate this value ourselves.

By filtering out these extraneous durations, we have eliminated 3,951 trips for a new dataset size of 350,354 rides.

Adding Additional Columns

Since we will be analyzing many of these rides grouped by their date and time, it is beneficial to add these columns as part of the data cleaning phase.

First, we will create a column for the date of the trip. We can reasonably assume that both the start time and stop time have the same base date attached to them as it is unlikely that there will be a trip spanning two days. Therefore, in order to make this date column we can use the start time column and extract the date from the datetime object we created earlier.

We will now use a very similar procedure to extract the month, year, and time of day, once again using properties of the datetime object.

As we can see in the output below, we have successfully updated our dataframe to include these columns.

Removing Extraneous Columns

We will now drop the user type (customer vs. subscriber) and bike id columns as they are not relevant to our analysis. In general, it is a good idea to make sure the working data is as concise as possible to minimize confusion and add clarity.

Creating Helper DataFrame of Station Mappings

Since our dataset contains both station ID and station name, it can be helpful to create a dataframe that will map a given station ID to its respective name. This will help us in the future when we need to use numeric data for our analysis.

Dividing Dataframe by Year

In order to simplify further analysis, we will create two additional dataframes - one for all rides in 2019 and one for all rides in 2020.


Data Exploration

We can now progress to the "data exploration" phase of our analysis. This stage will offer insights into underlying trends in our data, as well as a starting point for further exploration and analysis. We seek to better understand our data and reassess any assumptions we may have implicitly made about the dataset, and to clarify any misunderstandings before our data analysis section.

Descriptive Statistics

Descriptive statistics will make the data visualization process simpler and more effective. Here we seek to understand the data in a meaningful way that will allow for a simplified interpretation of the Citi Bike dataset overall. We also want to make sure that we do not distort our original data or overshadow any important details. Luckily, pandas has a describe function built into its library that we can make use of to get these statistics.

From the 2019 data output, we can see that the mean trip duration is 9.45 minutes. This is much lower as compared to the 2020 data, which has a mean of 19.84 minutes. However, it is important to note that the standard deviations for these groups is 13.55 and 26.05 minutes respectively. This means that the data is very spread out and not clustered around the mean, so using the mean as a statistic to represent the entire dataset may be inaccurate. Visit this resource to find out more about the role standard deviation plays when analyzing data.

When taking into consideration the year the customers were born in, we can see that the range is 1888 - 2004. If a customer was truly born in 1888, they would currently be 132 years old. This is interesting to note as it shows that users may enter incorrect birth dates. From a data science perspective, this also shows us that one can not rely on user-inputted data to be accurate and must be careful about making assumptions on data that may not be correct to begin with. Thus, for our purposes, we will not be further exploring birth year or using it in our analysis.

People Analysis

Here we would like to gain some insights into the demographics that make up our population. As mentioned above, we will not be looking into the ages of our users due to inaccurate birth dates. However, we can take a look at the gender makeup of the user population.

We can use a grouped bar chart from Plotly to see the breakdown by gender for both 2019 - 2020. The coloring of the groups and division of year helps to see all of the data on a single graph as opposed to multiple graphs with the counts.

With 1 representing males, 2 representing females, and 0 representing unknown genders, it seems that in both years there are higher proportions of men riders compared to female or unknown gender riders.

Mapping Stations

Since we will be conducting some analysis into the popularity of different stations, and since this data revolves around stations and Citi Bike trips, let’s map out where these stations are.

We'll start by creating a dataframe of all of the unique start stations, and all of the unique end stations.

From here, we can extract the respective latitudes and longitudes to position our markers, and then use Folium to display it on a map. Each marker is clickable to reveal the name of the station it represents.

From this map, we can see that there is a large cluster of stations just under Hoboken, and several stations spread out over New York City. Based on this, we can probably expect to see a lot of activity in those area since more stations are probably put in areas of high ridership.

Trip Durations

From our descriptive statistics, we know that there is a high standard deviation for the trip durations in both years. In order to see why this is the case, lets see the distribution of trip duration in a histogram. Histograms are useful in summarizing numerical data by showing how many points fall within a specific range of values (learn more here).

Once again, we will use Plotly to show the histograms for each year overlaid on each other so we can compare both years with a mutual axis.

We can see that the trip duration spreads from 0 - 300 minutes, with most trips falling under 1 hour in length. The highest trip duration range was 5 - 9 minutes, which makes sense since getting from one station to another over bike should not take too long. The reason our standard deviation was so high was because although most trips are concentrated under an hour, there are still many trips falling between 1 - 5 hours.

The output above also yields some interesting insight on the difference in duration between the two years. Because there were overall less rides in 2020, we would assume the counts for the red histogram (2020) should always lie under the purple histogram. However, for trip durations ranging from nearly 15 to 120 minutes, the count for 2020 seems to be higher than the count for 2019. This shows us that in 2020, people seemed to take longer rides as compared to 2019.

Trip Analysis

The last part of our exploratory analysis will consist of figuring out what the top 10 trips for each year are. Since each ride has a start and stop station, we define a trip as the ride from Station A to Station B. Some of these trips can be loops, indicating that the user started and returned to the same station in their trip.

We can find the top trips using panda's groupby, which allows you to group entries in a DataFrame by certain values and perform operations on the groups.

It looks like the station "Grove St PATH" was extremely popular in 2019, but not so much in 2020. It's possible that this station is in a very popular area that was much less populated during the pandemic. It is also interesting to note that almost all of the top trips in 2020 are loops. This could be because riders in 2019 used Citi Bike to get from one place to another, but riders in 2020 were more along the lines of "joyriders" who went for a quick ride and returned to where they started. This could be due to the fact that office buildings were closed with the lockdown, so more users were visitors to NYC who fell under a "tourist" label.


Data Analysis

Now that we have a much better understanding of what our data contains, we can go ahead and begin the "data analysis" stage. This is the core stage in the data science pipeline that involved analyzing and modeling the data so that we can answer our original hypothesis.

Timeseries

Timeseries visualizations are a good way to identify trends in your data as time goes on. They allow us to see the direction or movement of certain variables in our data, and offer insights into areas to focus our analysis on. It is important to note that a normal time series with many dates can have a lot of fluctuation, but it is still useful to note trends.

Let's make a timeseries for our trip data. We will once again use groupby to get our counts per date, and we'll use Plotly to visualize the graph.

From our timeseries plot, we can make a few key observations. First, in 2019, we can see that there is a spike in ridership in the summer months, possibly due to good weather. Then, there is a dip on September 2nd, or Labor Day. Here we might hypothesize that it may due to people being with family and at home. In terms of seasonality, we see that there is a downward trend in ridership as weather gets colder in December.

In 2020, we can see some interesting deviations from our 2019 trends, most of which can be mapped to different phases of NYC's plan to halt the spread of the COVID-19 virus (learn more here). March 23rd was NYC’s first official day of lockdown, and we can see in our plot that this was the lowest point in the time series with only 24 rides. The fact that there are any rides at all may be due to essential workers. The months of March and April were uncharacteristically low, followed by a spike in June. This spike can be tied to phase one of NYC's reopening plan, which began on June 8th.

One of the most interesting parts of this visualization is the very large spike in the beginning of October 2020. These few days had the highest count of trips even when compared to 2019. However, when looking at NYC's reopening plan, we can see that phase four of the plan commenced on October 8th, marking the opening of museums, botanical gardens, and gyms. With thousands of people stuck at home for months beforehand, this reopening could have led to a very large spike in visits to NYC, thus increasing the ridership significantly.

Now that we have established the raw count differences over time, let's see if we can break it down into its locations and analyze the the popularity of different trips and stations more in depth. In the exploration phase, we did some elementary counts on the most popular stations and trips, but in order to see where ridership is physically concentrated and how it changes over time, let’s create some maps using Folium.

Even though we listed the 10 most popular trips in the exploratory phase, it can be extremely helpful to be able to visualize their locations. We'll be using Folium's PolyLine feature to connect two station markers.

As we had seen in the chart of trips from our exploratory analysis, the top 5 paths all pass through Grove St PATH. From the map we can see that this station is in a very popular and central location. It is likely that this station is located near a lot of popular spots in the city, which leads to a higher ridership.

Unlike 2019, 2020 does not have any trips going through Grove St PATH. All of the popular trips are loops, and we can further justify our claim that the riders in 2020 are more likely to be "joyriders." If we look at where some of the stations are located, we can see that the stations are right next to Lincoln Park, Liberty State Park, Alexander F Santora Park, St Peters Field, and Elephant Park. These proximities lead us to believe that the users using these bikes are most likely taking a ride around the park to get some fresh air during lockdown.

Overall, this visualization helps us see the difference in the types of users based on their potential motives for using Citi Bike's services.

Now that we've seen the popular trips, lets break it down even further by station. First, we will create a heatmap for each year showing the concentration of ridership in the NYC area. Heatmaps are a great way to see clustering and magnitude of data using color. You can learn more about heatmaps here.

To culminate our final analysis, we will now try to see if we can actually predict what the most popular station will be based on the time of day and month. This is the 'Machine Learning" stage in our data science process, which refers to how artificial intelligence (AI) is used to provide systems the ability to learn and improve from training without having hardcoding or explicitly coded data. The machine is essentially “learning” from the training data provided. This will enable us to see if we can apply certain tools to get the computer to accurately predict some variable.

For our prediction analysis, we are attempting to answer: Can we predict the most popular start station for a given time of day and month of year? Thus, our predictors are the time and of day and month of year, and our response variable is the most popular start station.

For this tutorial, we will be making use of the following libraries found in scikit-learn: LabelEncoder and train_test_split to prepare the data, and MLPClassifier and accuracy_score to model and score the predictions. The purpose of these libraries will be explained throughout this section.

In order to be able to input our data into a predictive model, we first need to format and prepare the data. This is also known as “data preprocessing.” First, we create a column for the numerical form of the month since most models do not allow for categorical data in String format. From there, we want to create a DataFrame where each row has the following columns: month, time of day, and most popular station for that month & time of day. After dropping all null values (created by a potential lack in stations fitting the month and time criteria), we use LabelEncoder to standardize the data. This brings the four-digit station ID’s down to a single digit number starting from 0. After following these steps, we now have a DataFrame that is ready to be inputted into our predictive model.

We are now ready to start inputing our data into a model. A standard practice in machine learning is to split up our data into a “training” and “testing” set. This allows us to separate a small portion (in our case 20%) of the data for prediction purposes, and use the remaining data (80%) to actually train the model. This prevents us from testing the same data on a trained model. The train_test_split library in scikit-learn helps us easily create these partitions in the dataset. Note that we pass in test_size = 0.2 to indicate a 80-20% split.

For our model itself, we will be using an MLP Classifier. MLPClassifier stands for Multi-layer Perceptron classifier, which is supported by Neural Network to perform classification. The training is done by backpropogration, if you are interested in learning more click here. We will pass in our training X and y data in order to train (or “fit”) the model, and then predict using the testing X. After the predictions are made, we compare these values to our testing y values using scikit-learn’s accuracy_score.

Since the output of the actual prediction is an encoded label representing the station ID, it can be helpful to transform this value into the actual station name to better understand our results. We have created a function to output this transformed dataframe so that we can see for a particular month and time of day, what station did our model predict to be the most popular.

Now that our functions for this prediction process have been defined, lets actually run them on both the 2019 and 2020 datasets.

As explained above, we begin by preprocessing the data.

With our prepared datasets, we can now create our model and check its accuracy.

The accuracy for the MLPClassifier on the 2019 dataset is 79%, whereas it is only 45% for the 2020 dataset. This may be due to the less rides taken and fewer months covered (no November or December data) in the 2020 dataset, which results in less training data for the model. Another possible reason could be that time of day and month do not explain the variance in the most popular station in 2020 because people are no longer on a more regular schedule for transportation to work or school, nor are they restricted to outside normal work or school hours as people’s schedules are more blurred now. In 2020, as suggested by our analysis in the maps above, it may be the case that bike rides are for “joyrides” and so times of day do not impact the most popular station as consistently as they once did. Regardless, we can still see what stations consistently appear in our predictions for the most popular station for a given month and time.

Our predicted stations for 2019 seem to show that in general, Grove St PATH is the most popular station. This matches our analysis from the previous sections.

Although the accuracy for the 2020 data is not as accurate as 2019, we can see that Liberty Light Rail appears very frequently in this output. This matched our previous analysis on the generic popular stations.

This predictive model helps us expand upon our previous analysis and actually see what times of day and which months the stations are more popular. It helps us gain a much better understanding of the data for each year, and together with our prior analysis helps us to understand the difference between ridership in 2019 and 2020.


Conclusion

From our analysis, we have uncovered a few changes from the Citi bike rides from 2019 and in 2020. Firstly, we notice that ridership dipped after the statewide shutdown in New York, and rose again after reopening phases. We also were able to visualize the changing concentrations of stations traffic from 2019 to 2020 from our timed heatmap. From other map plots, we were able to see how in 2020, most popular stations were closer to parks and most popular trips consisted of many more loops, which may allude to more “joyrides” or rides taken for leisure or fun.

Most importantly, we saw some differences in our predictions of the most popular start station at a given time of day and month of year. This information can be leveraged, especially when it comes to public safety. More hand sanitizing stations can be placed at more popular stations, and bike sanitization times for each station can be set to be around the times where that station was the most popular. This can decrease the germs spread and decrease the number of people that are exposed to previous rider’s germs. Some other uses of these predictions could be redistributing bikes so that most popular stations have sufficient bikes at the times of day they are needed most, or perhaps bikes can be removed from the least popular stations at a certain month and time. The applications of this analysis truly lie in understanding the changing dynamics of bike ridership, resource allocation, bike sanitization, and public safety. In the face of COVID-19, public points of contact are sources of possible spread of the virus, and this data can help minimize the risk associated with those sources.