In the face of our nation's ongoing battle with the public health crisis, COVID-19, the Citi Bike electric rental bike stations in New York City, NY are still in business. These bikes are points of human contact and unfortunately can serve as vectors for the spread of germs and viruses such as COVID-19. Citi Bike is privately owned by Lyft, but is a public bike sharing system that leverages mobile technology to serve regions in the Bronx, Manhattan, Queens, and Jersey City, NJ. Their organization and technical capabilities extend beyond their app and into the realm of data collection as the company stores detailed information on all bike rides since 2013. Their faithful data collection offers an interesting dataset to conduct further analysis to gain insights on anything from coordinate data, station locations, gender analysis of riders, and even the durations of trips.
However, for the purposes of our analysis, we would like to answer the question: How has Citi Bike ridership behavior changed after the start of the COVID-19 pandemic? More generally, we would like to use insights gained from our data to see if we can increase the safety of Citi Bike locations in terms of minimizing the spread of the virus. We hypothesize that there will be a difference between the frequency of rides taken, most popular stations, the ridership patterns with respect to the trips taken and time of the trips.
We will be using a variety of different Python libraries throughout this tutorial. We have outlined the core libraries below:
Pandas - This is a very popular data analysis tool that allows you to easily work with tabular data in Python. The heart of pandas lies in a DataFrame object, which is essentially the table holding all of your data. Pandas comes with a wide variety of built-in functions that allow you to perform many different operations on the data. We will make use of several of these functions throughout the tutorial.
Plotly.py - This is an open source graphing library that allows you to create interactive visualizations such as line charts, bar charts, and histograms. All Plotly graphs can be hovered over and clicked on to show additional information. They can also be zoomed in and out based on the users preference. Since the visualizations we will be making are relatively simple, we will be using Plotly Express, which is a shorter and simpler high-level version of Plotly.
Folium - This is a powerful data visualization library that lets us plot coordinates on an interactive Leaflet map. These maps contain all the streets and landmarks you would be able to find on Google Maps, but with an added level of interactivity. Most of these maps require you to use latitude and longitude to annotate and add onto the base map.
Scikit-Learn - This is an extremely popular machine learning and analysis library that provides simple and efficient tools used for predictive data analysis. It is open source, and is built upon several other popular Python libraries like NumPy, SciPy, and matplotlib.
All of the code and datasets used for this tutorial can be found at this repository.
# importing required libraries
import pandas as pd
import numpy as np
import plotly.express as px
import datetime as dt
import folium
# silencing warnings to prevent clutter in output
import warnings
warnings.filterwarnings('ignore')
In this data collection phase, we will focus on collecting and compiling our data in one place so that it is usable for our analysis.
We want to obtain information on all of the rides that have occurred within our designated time period using Citi Bike’s publicly available trip data. This data includes:
We have downloaded the csv files for every month from Jan 2019 - Oct 2020 (found here). To be able to proceed with our analysis, we need to combine these individual months into a single structure. Since each of these files can contain up to 50,000 entries, combining all of this data directly would not be possible due to memory constraints. To mitigate this error, we randomly sample 50% of the data from each month when appending to our DataFrame.
We have included the code used to create our final DataFrame below. Since its construction is based on random sampling, we executed this code to create our final dataset (citibike_compiled_data.csv). For the purpose of the tutorial, we will simply load the pre-created csv file into our DataFrame.
import random
import os
# extract all data files in "datasets" folder (contains downloaded .csvs for all months)
files = os.listdir("datasets")
df = pd.DataFrame()
# we will sample 50% of entries from each month
p = 0.50
# loop through all files, random sample entries, and append to the master dataframe
for file in files:
df = df.append(pd.read_csv(f"datasets/{file}", skiprows=lambda i: i>0 and random.random() > p), ignore_index=True)
# save dataframe as csv
df.to_csv("citibike_compiled_data.csv", index=False)
# read in the saved csv created from the compilation step
rides = pd.read_csv('citibike_compiled_data.csv')
rides
tripduration | starttime | stoptime | start station id | start station name | start station latitude | start station longitude | end station id | end station name | end station latitude | end station longitude | bikeid | usertype | birth year | gender | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1575 | 2019-01-01 12:43:38.6430 | 2019-01-01 13:09:54.5280 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3638 | Washington St | 40.724294 | -74.035483 | 29672 | Customer | 1969 | 0 |
1 | 737 | 2019-01-01 12:56:53.2040 | 2019-01-01 13:09:11.0400 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3205 | JC Medical Center | 40.716540 | -74.049638 | 29447 | Subscriber | 1993 | 1 |
2 | 917 | 2019-01-01 13:03:44.7760 | 2019-01-01 13:19:02.7690 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3277 | Communipaw & Berry Lane | 40.714358 | -74.066611 | 29299 | Subscriber | 1986 | 1 |
3 | 3248 | 2019-01-01 13:12:03.1280 | 2019-01-01 14:06:12.0400 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3196 | Riverview Park | 40.744319 | -74.043991 | 29495 | Subscriber | 1992 | 1 |
4 | 3168 | 2019-01-01 13:13:12.0450 | 2019-01-01 14:06:00.4110 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3196 | Riverview Park | 40.744319 | -74.043991 | 26312 | Customer | 1969 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
354300 | 131 | 2020-10-31 23:24:41.8040 | 2020-10-31 23:26:53.2340 | 3681 | Grand St | 40.715178 | -74.037683 | 3185 | City Hall | 40.717732 | -74.043845 | 46532 | Subscriber | 1981 | 1 |
354301 | 338 | 2020-10-31 23:29:49.3050 | 2020-10-31 23:35:27.8680 | 3195 | Sip Ave | 40.730897 | -74.063913 | 3679 | Bergen Ave | 40.722104 | -74.071455 | 26139 | Subscriber | 1999 | 1 |
354302 | 1054 | 2020-10-31 23:32:22.8650 | 2020-10-31 23:49:57.5320 | 3198 | Heights Elevator | 40.748716 | -74.040443 | 3678 | Fairmount Ave | 40.725726 | -74.071959 | 17231 | Customer | 1969 | 0 |
354303 | 156 | 2020-10-31 23:33:22.7310 | 2020-10-31 23:35:59.4930 | 3185 | City Hall | 40.717732 | -74.043845 | 3681 | Grand St | 40.715178 | -74.037683 | 46532 | Subscriber | 1981 | 1 |
354304 | 395 | 2020-10-31 23:40:06.3530 | 2020-10-31 23:46:42.0640 | 3481 | York St | 40.716490 | -74.041050 | 3273 | Manila & 1st | 40.721651 | -74.042884 | 45647 | Subscriber | 1967 | 1 |
354305 rows × 15 columns
The DataFrame "rides" now holds all of our ride data that we will continue to use throughout the rest of the tutorial. We can see that this dataset contains 354,305 rides ranging from 1/1/2019 - 10/31/2020.
Before we go on to analyze the data, we need to “fix” the organization and structure of the dataset. This process is known as “data tidying” or “data wrangling.”
Since we are dealing with data over time, we want to make sure that the time and date columns are easy to work with. We will make use of Python datetime objects to accomplish this.
# use built-in pandas to_datetime() to convert column types
rides["starttime"] = pd.to_datetime(rides.starttime)
rides["stoptime"] = pd.to_datetime(rides.stoptime)
# print types for all colums in dataframe
rides.dtypes
tripduration int64 starttime datetime64[ns] stoptime datetime64[ns] start station id int64 start station name object start station latitude float64 start station longitude float64 end station id int64 end station name object end station latitude float64 end station longitude float64 bikeid int64 usertype object birth year int64 gender int64 dtype: object
We can now see that the starttime and stoptime columns are datetime objects.
To obtain more useful and clear analysis and visualizations, we will convert the trip duration, which is currently given in seconds, into minutes. As many trips are longer than 2 mins, seconds quickly become hard to understand and differentiate in terms of length.
# divide seconds by 60 to get number of minutes
rides['tripduration_minutes'] = rides['tripduration'].astype(int)/60
# round to the nearest minute
rides['tripduration_minutes'] = round(rides['tripduration_minutes'])
# make column data-type to integer
rides['tripduration_minutes'] = rides['tripduration_minutes'].astype(int)
rides
tripduration | starttime | stoptime | start station id | start station name | start station latitude | start station longitude | end station id | end station name | end station latitude | end station longitude | bikeid | usertype | birth year | gender | tripduration_minutes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1575 | 2019-01-01 12:43:38.643 | 2019-01-01 13:09:54.528 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3638 | Washington St | 40.724294 | -74.035483 | 29672 | Customer | 1969 | 0 | 26 |
1 | 737 | 2019-01-01 12:56:53.204 | 2019-01-01 13:09:11.040 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3205 | JC Medical Center | 40.716540 | -74.049638 | 29447 | Subscriber | 1993 | 1 | 12 |
2 | 917 | 2019-01-01 13:03:44.776 | 2019-01-01 13:19:02.769 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3277 | Communipaw & Berry Lane | 40.714358 | -74.066611 | 29299 | Subscriber | 1986 | 1 | 15 |
3 | 3248 | 2019-01-01 13:12:03.128 | 2019-01-01 14:06:12.040 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3196 | Riverview Park | 40.744319 | -74.043991 | 29495 | Subscriber | 1992 | 1 | 54 |
4 | 3168 | 2019-01-01 13:13:12.045 | 2019-01-01 14:06:00.411 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3196 | Riverview Park | 40.744319 | -74.043991 | 26312 | Customer | 1969 | 0 | 53 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
354300 | 131 | 2020-10-31 23:24:41.804 | 2020-10-31 23:26:53.234 | 3681 | Grand St | 40.715178 | -74.037683 | 3185 | City Hall | 40.717732 | -74.043845 | 46532 | Subscriber | 1981 | 1 | 2 |
354301 | 338 | 2020-10-31 23:29:49.305 | 2020-10-31 23:35:27.868 | 3195 | Sip Ave | 40.730897 | -74.063913 | 3679 | Bergen Ave | 40.722104 | -74.071455 | 26139 | Subscriber | 1999 | 1 | 6 |
354302 | 1054 | 2020-10-31 23:32:22.865 | 2020-10-31 23:49:57.532 | 3198 | Heights Elevator | 40.748716 | -74.040443 | 3678 | Fairmount Ave | 40.725726 | -74.071959 | 17231 | Customer | 1969 | 0 | 18 |
354303 | 156 | 2020-10-31 23:33:22.731 | 2020-10-31 23:35:59.493 | 3185 | City Hall | 40.717732 | -74.043845 | 3681 | Grand St | 40.715178 | -74.037683 | 46532 | Subscriber | 1981 | 1 | 3 |
354304 | 395 | 2020-10-31 23:40:06.353 | 2020-10-31 23:46:42.064 | 3481 | York St | 40.716490 | -74.041050 | 3273 | Manila & 1st | 40.721651 | -74.042884 | 45647 | Subscriber | 1967 | 1 | 7 |
354305 rows × 16 columns
Another factor we want to consider is trip duration. Since we are trying to find trends associated with the trips, we want to ensure that we remove any trips that may be outliers with reasonable assumptions. For example, we will remove any trips that are under 2 mins as we assume these are either mistakenly rented bikes, user error in the checkout process, or the user was simply testing the bike. We will then remove any trips that are over 5 hours (300 minutes) in length as we assume that the bike was improperly checked in, stolen, lost, etc.
As we've seen in the previous section, the data given already contains a column for trip duration, so we do not need to calculate this value ourselves.
# only select trips matching the specified duration range
rides = rides[(rides.tripduration_minutes >= 2) & (rides.tripduration_minutes <= 300)]
rides = rides.reset_index(drop=True)
rides
tripduration | starttime | stoptime | start station id | start station name | start station latitude | start station longitude | end station id | end station name | end station latitude | end station longitude | bikeid | usertype | birth year | gender | tripduration_minutes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1575 | 2019-01-01 12:43:38.643 | 2019-01-01 13:09:54.528 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3638 | Washington St | 40.724294 | -74.035483 | 29672 | Customer | 1969 | 0 | 26 |
1 | 737 | 2019-01-01 12:56:53.204 | 2019-01-01 13:09:11.040 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3205 | JC Medical Center | 40.716540 | -74.049638 | 29447 | Subscriber | 1993 | 1 | 12 |
2 | 917 | 2019-01-01 13:03:44.776 | 2019-01-01 13:19:02.769 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3277 | Communipaw & Berry Lane | 40.714358 | -74.066611 | 29299 | Subscriber | 1986 | 1 | 15 |
3 | 3248 | 2019-01-01 13:12:03.128 | 2019-01-01 14:06:12.040 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3196 | Riverview Park | 40.744319 | -74.043991 | 29495 | Subscriber | 1992 | 1 | 54 |
4 | 3168 | 2019-01-01 13:13:12.045 | 2019-01-01 14:06:00.411 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3196 | Riverview Park | 40.744319 | -74.043991 | 26312 | Customer | 1969 | 0 | 53 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
350349 | 131 | 2020-10-31 23:24:41.804 | 2020-10-31 23:26:53.234 | 3681 | Grand St | 40.715178 | -74.037683 | 3185 | City Hall | 40.717732 | -74.043845 | 46532 | Subscriber | 1981 | 1 | 2 |
350350 | 338 | 2020-10-31 23:29:49.305 | 2020-10-31 23:35:27.868 | 3195 | Sip Ave | 40.730897 | -74.063913 | 3679 | Bergen Ave | 40.722104 | -74.071455 | 26139 | Subscriber | 1999 | 1 | 6 |
350351 | 1054 | 2020-10-31 23:32:22.865 | 2020-10-31 23:49:57.532 | 3198 | Heights Elevator | 40.748716 | -74.040443 | 3678 | Fairmount Ave | 40.725726 | -74.071959 | 17231 | Customer | 1969 | 0 | 18 |
350352 | 156 | 2020-10-31 23:33:22.731 | 2020-10-31 23:35:59.493 | 3185 | City Hall | 40.717732 | -74.043845 | 3681 | Grand St | 40.715178 | -74.037683 | 46532 | Subscriber | 1981 | 1 | 3 |
350353 | 395 | 2020-10-31 23:40:06.353 | 2020-10-31 23:46:42.064 | 3481 | York St | 40.716490 | -74.041050 | 3273 | Manila & 1st | 40.721651 | -74.042884 | 45647 | Subscriber | 1967 | 1 | 7 |
350354 rows × 16 columns
By filtering out these extraneous durations, we have eliminated 3,951 trips for a new dataset size of 350,354 rides.
Since we will be analyzing many of these rides grouped by their date and time, it is beneficial to add these columns as part of the data cleaning phase.
First, we will create a column for the date of the trip. We can reasonably assume that both the start time and stop time have the same base date attached to them as it is unlikely that there will be a trip spanning two days. Therefore, in order to make this date column we can use the start time column and extract the date from the datetime object we created earlier.
rides["trip_date"] = rides['starttime'].dt.date
We will now use a very similar procedure to extract the month, year, and time of day, once again using properties of the datetime object.
rides["trip_month"] = rides['starttime'].dt.month_name()
rides["trip_year"] = rides['starttime'].dt.year
rides["trip_tod"] = rides['starttime'].dt.hour
As we can see in the output below, we have successfully updated our dataframe to include these columns.
rides
tripduration | starttime | stoptime | start station id | start station name | start station latitude | start station longitude | end station id | end station name | end station latitude | end station longitude | bikeid | usertype | birth year | gender | tripduration_minutes | trip_date | trip_month | trip_year | trip_tod | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1575 | 2019-01-01 12:43:38.643 | 2019-01-01 13:09:54.528 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3638 | Washington St | 40.724294 | -74.035483 | 29672 | Customer | 1969 | 0 | 26 | 2019-01-01 | January | 2019 | 12 |
1 | 737 | 2019-01-01 12:56:53.204 | 2019-01-01 13:09:11.040 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3205 | JC Medical Center | 40.716540 | -74.049638 | 29447 | Subscriber | 1993 | 1 | 12 | 2019-01-01 | January | 2019 | 12 |
2 | 917 | 2019-01-01 13:03:44.776 | 2019-01-01 13:19:02.769 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3277 | Communipaw & Berry Lane | 40.714358 | -74.066611 | 29299 | Subscriber | 1986 | 1 | 15 | 2019-01-01 | January | 2019 | 13 |
3 | 3248 | 2019-01-01 13:12:03.128 | 2019-01-01 14:06:12.040 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3196 | Riverview Park | 40.744319 | -74.043991 | 29495 | Subscriber | 1992 | 1 | 54 | 2019-01-01 | January | 2019 | 13 |
4 | 3168 | 2019-01-01 13:13:12.045 | 2019-01-01 14:06:00.411 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3196 | Riverview Park | 40.744319 | -74.043991 | 26312 | Customer | 1969 | 0 | 53 | 2019-01-01 | January | 2019 | 13 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
350349 | 131 | 2020-10-31 23:24:41.804 | 2020-10-31 23:26:53.234 | 3681 | Grand St | 40.715178 | -74.037683 | 3185 | City Hall | 40.717732 | -74.043845 | 46532 | Subscriber | 1981 | 1 | 2 | 2020-10-31 | October | 2020 | 23 |
350350 | 338 | 2020-10-31 23:29:49.305 | 2020-10-31 23:35:27.868 | 3195 | Sip Ave | 40.730897 | -74.063913 | 3679 | Bergen Ave | 40.722104 | -74.071455 | 26139 | Subscriber | 1999 | 1 | 6 | 2020-10-31 | October | 2020 | 23 |
350351 | 1054 | 2020-10-31 23:32:22.865 | 2020-10-31 23:49:57.532 | 3198 | Heights Elevator | 40.748716 | -74.040443 | 3678 | Fairmount Ave | 40.725726 | -74.071959 | 17231 | Customer | 1969 | 0 | 18 | 2020-10-31 | October | 2020 | 23 |
350352 | 156 | 2020-10-31 23:33:22.731 | 2020-10-31 23:35:59.493 | 3185 | City Hall | 40.717732 | -74.043845 | 3681 | Grand St | 40.715178 | -74.037683 | 46532 | Subscriber | 1981 | 1 | 3 | 2020-10-31 | October | 2020 | 23 |
350353 | 395 | 2020-10-31 23:40:06.353 | 2020-10-31 23:46:42.064 | 3481 | York St | 40.716490 | -74.041050 | 3273 | Manila & 1st | 40.721651 | -74.042884 | 45647 | Subscriber | 1967 | 1 | 7 | 2020-10-31 | October | 2020 | 23 |
350354 rows × 20 columns
We will now drop the user type (customer vs. subscriber) and bike id columns as they are not relevant to our analysis. In general, it is a good idea to make sure the working data is as concise as possible to minimize confusion and add clarity.
# drop columns from dataframe
rides = rides.drop(columns=['usertype', 'bikeid'])
rides
tripduration | starttime | stoptime | start station id | start station name | start station latitude | start station longitude | end station id | end station name | end station latitude | end station longitude | birth year | gender | tripduration_minutes | trip_date | trip_month | trip_year | trip_tod | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1575 | 2019-01-01 12:43:38.643 | 2019-01-01 13:09:54.528 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3638 | Washington St | 40.724294 | -74.035483 | 1969 | 0 | 26 | 2019-01-01 | January | 2019 | 12 |
1 | 737 | 2019-01-01 12:56:53.204 | 2019-01-01 13:09:11.040 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3205 | JC Medical Center | 40.716540 | -74.049638 | 1993 | 1 | 12 | 2019-01-01 | January | 2019 | 12 |
2 | 917 | 2019-01-01 13:03:44.776 | 2019-01-01 13:19:02.769 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3277 | Communipaw & Berry Lane | 40.714358 | -74.066611 | 1986 | 1 | 15 | 2019-01-01 | January | 2019 | 13 |
3 | 3248 | 2019-01-01 13:12:03.128 | 2019-01-01 14:06:12.040 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3196 | Riverview Park | 40.744319 | -74.043991 | 1992 | 1 | 54 | 2019-01-01 | January | 2019 | 13 |
4 | 3168 | 2019-01-01 13:13:12.045 | 2019-01-01 14:06:00.411 | 3183 | Exchange Place | 40.716247 | -74.033459 | 3196 | Riverview Park | 40.744319 | -74.043991 | 1969 | 0 | 53 | 2019-01-01 | January | 2019 | 13 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
350349 | 131 | 2020-10-31 23:24:41.804 | 2020-10-31 23:26:53.234 | 3681 | Grand St | 40.715178 | -74.037683 | 3185 | City Hall | 40.717732 | -74.043845 | 1981 | 1 | 2 | 2020-10-31 | October | 2020 | 23 |
350350 | 338 | 2020-10-31 23:29:49.305 | 2020-10-31 23:35:27.868 | 3195 | Sip Ave | 40.730897 | -74.063913 | 3679 | Bergen Ave | 40.722104 | -74.071455 | 1999 | 1 | 6 | 2020-10-31 | October | 2020 | 23 |
350351 | 1054 | 2020-10-31 23:32:22.865 | 2020-10-31 23:49:57.532 | 3198 | Heights Elevator | 40.748716 | -74.040443 | 3678 | Fairmount Ave | 40.725726 | -74.071959 | 1969 | 0 | 18 | 2020-10-31 | October | 2020 | 23 |
350352 | 156 | 2020-10-31 23:33:22.731 | 2020-10-31 23:35:59.493 | 3185 | City Hall | 40.717732 | -74.043845 | 3681 | Grand St | 40.715178 | -74.037683 | 1981 | 1 | 3 | 2020-10-31 | October | 2020 | 23 |
350353 | 395 | 2020-10-31 23:40:06.353 | 2020-10-31 23:46:42.064 | 3481 | York St | 40.716490 | -74.041050 | 3273 | Manila & 1st | 40.721651 | -74.042884 | 1967 | 1 | 7 | 2020-10-31 | October | 2020 | 23 |
350354 rows × 18 columns
Since our dataset contains both station ID and station name, it can be helpful to create a dataframe that will map a given station ID to its respective name. This will help us in the future when we need to use numeric data for our analysis.
# group by station id and name and get unique pairing
station_mappings=rides.groupby(['start station id', 'start station name']).size().reset_index(name = 'Count')
# drop index column
station_mappings=station_mappings.drop(["Count"], axis=1)
# make start station id the new index
station_mappings=station_mappings.set_index('start station id')
# display dataframe
station_mappings
start station name | |
---|---|
start station id | |
3183 | Exchange Place |
3184 | Paulus Hook |
3185 | City Hall |
3186 | Grove St PATH |
3187 | Warren St |
3191 | Union St |
3192 | Liberty Light Rail |
3193 | Lincoln Park |
3194 | McGinley Square |
3195 | Sip Ave |
3196 | Riverview Park |
3198 | Heights Elevator |
3199 | Newport Pkwy |
3201 | Dey St |
3202 | Newport PATH |
3203 | Hamilton Park |
3205 | JC Medical Center |
3206 | Hilltop |
3207 | Oakland Ave |
3209 | Brunswick St |
3210 | Pershing Field |
3211 | Newark Ave |
3212 | Christ Hospital |
3213 | Van Vorst Park |
3214 | Essex Light Rail |
3220 | 5 Corners Library |
3225 | Baldwin at Montgomery |
3267 | Morris Canal |
3268 | Lafayette Park |
3269 | Brunswick & 6th |
3270 | Jersey & 6th St |
3272 | Jersey & 3rd |
3273 | Manila & 1st |
3275 | Columbus Drive |
3276 | Marin Light Rail |
3277 | Communipaw & Berry Lane |
3278 | Monmouth and 6th |
3279 | Dixon Mills |
3280 | Astor Place |
3281 | Leonard Gordon Park |
3426 | JCBS Depot |
3481 | York St |
3483 | Montgomery St |
3638 | Washington St |
3639 | Harborside |
3640 | Journal Square |
3677 | Glenwood Ave |
3678 | Fairmount Ave |
3679 | Bergen Ave |
3681 | Grand St |
3694 | Jackson Square |
3791 | Hoboken Ave at Monmouth St |
3792 | Columbus Dr at Exchange Pl |
In order to simplify further analysis, we will create two additional dataframes - one for all rides in 2019 and one for all rides in 2020.
rides_2019 = rides.loc[rides['trip_year'] == 2019]
rides_2020 = rides.loc[rides['trip_year'] == 2020]
We can now progress to the "data exploration" phase of our analysis. This stage will offer insights into underlying trends in our data, as well as a starting point for further exploration and analysis. We seek to better understand our data and reassess any assumptions we may have implicitly made about the dataset, and to clarify any misunderstandings before our data analysis section.
Descriptive statistics will make the data visualization process simpler and more effective. Here we seek to understand the data in a meaningful way that will allow for a simplified interpretation of the Citi Bike dataset overall. We also want to make sure that we do not distort our original data or overshadow any important details. Luckily, pandas has a describe function built into its library that we can make use of to get these statistics.
rides_2019.describe()
tripduration | start station id | start station latitude | start station longitude | end station id | end station latitude | end station longitude | birth year | gender | tripduration_minutes | trip_year | trip_tod | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 199910.000000 | 199910.000000 | 199910.000000 | 199910.000000 | 199910.000000 | 199910.000000 | 199910.000000 | 199910.000000 | 199910.000000 | 199910.000000 | 199910.0 | 199910.000000 |
mean | 566.726822 | 3293.810170 | 40.722743 | -74.046387 | 3292.022870 | 40.722335 | -74.045840 | 1981.415072 | 1.163629 | 9.448752 | 2019.0 | 13.595038 |
std | 813.039952 | 170.892349 | 0.007063 | 0.010896 | 181.114428 | 0.006894 | 0.010872 | 10.221026 | 0.516599 | 13.551119 | 0.0 | 5.087196 |
min | 90.000000 | 3183.000000 | 40.709651 | -74.083639 | 116.000000 | 40.701403 | -74.083639 | 1888.000000 | 0.000000 | 2.000000 | 2019.0 | 0.000000 |
25% | 236.000000 | 3195.000000 | 40.718355 | -74.050444 | 3192.000000 | 40.718355 | -74.050389 | 1975.000000 | 1.000000 | 4.000000 | 2019.0 | 9.000000 |
50% | 347.000000 | 3210.000000 | 40.721525 | -74.043845 | 3207.000000 | 40.721124 | -74.043117 | 1984.000000 | 1.000000 | 6.000000 | 2019.0 | 14.000000 |
75% | 583.000000 | 3276.000000 | 40.727224 | -74.038051 | 3276.000000 | 40.727224 | -74.038051 | 1989.000000 | 1.000000 | 10.000000 | 2019.0 | 18.000000 |
max | 17852.000000 | 3792.000000 | 40.748716 | -74.032108 | 3792.000000 | 40.814326 | -73.932077 | 2003.000000 | 2.000000 | 298.000000 | 2019.0 | 23.000000 |
rides_2020.describe()
tripduration | start station id | start station latitude | start station longitude | end station id | end station latitude | end station longitude | birth year | gender | tripduration_minutes | trip_year | trip_tod | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 150444.000000 | 150444.000000 | 150444.000000 | 150444.000000 | 150444.000000 | 150444.000000 | 150444.000000 | 150444.000000 | 150444.000000 | 150444.000000 | 150444.0 | 150444.000000 |
mean | 1190.597132 | 3304.291736 | 40.722243 | -74.046737 | 3303.386689 | 40.721718 | -74.045877 | 1981.331093 | 1.095664 | 19.846288 | 2020.0 | 14.450380 |
std | 1562.682195 | 185.064015 | 0.007389 | 0.011933 | 191.005562 | 0.105247 | 0.191272 | 11.013114 | 0.633380 | 26.046190 | 0.0 | 4.794887 |
min | 90.000000 | 3184.000000 | 40.709651 | -74.083639 | 82.000000 | 0.000000 | -74.083639 | 1888.000000 | 0.000000 | 2.000000 | 2020.0 | 0.000000 |
25% | 333.000000 | 3195.000000 | 40.716870 | -74.051789 | 3195.000000 | 40.716870 | -74.050656 | 1969.000000 | 1.000000 | 6.000000 | 2020.0 | 11.000000 |
50% | 632.000000 | 3209.000000 | 40.721124 | -74.043991 | 3209.000000 | 40.721124 | -74.043845 | 1984.000000 | 1.000000 | 11.000000 | 2020.0 | 15.000000 |
75% | 1459.000000 | 3276.000000 | 40.727224 | -74.037683 | 3276.000000 | 40.727224 | -74.037683 | 1990.000000 | 2.000000 | 24.000000 | 2020.0 | 18.000000 |
max | 18018.000000 | 3792.000000 | 40.748716 | -74.032108 | 4069.000000 | 40.848282 | 0.000000 | 2004.000000 | 2.000000 | 300.000000 | 2020.0 | 23.000000 |
From the 2019 data output, we can see that the mean trip duration is 9.45 minutes. This is much lower as compared to the 2020 data, which has a mean of 19.84 minutes. However, it is important to note that the standard deviations for these groups is 13.55 and 26.05 minutes respectively. This means that the data is very spread out and not clustered around the mean, so using the mean as a statistic to represent the entire dataset may be inaccurate. Visit this resource to find out more about the role standard deviation plays when analyzing data.
When taking into consideration the year the customers were born in, we can see that the range is 1888 - 2004. If a customer was truly born in 1888, they would currently be 132 years old. This is interesting to note as it shows that users may enter incorrect birth dates. From a data science perspective, this also shows us that one can not rely on user-inputted data to be accurate and must be careful about making assumptions on data that may not be correct to begin with. Thus, for our purposes, we will not be further exploring birth year or using it in our analysis.
Here we would like to gain some insights into the demographics that make up our population. As mentioned above, we will not be looking into the ages of our users due to inaccurate birth dates. However, we can take a look at the gender makeup of the user population.
We can use a grouped bar chart from Plotly to see the breakdown by gender for both 2019 - 2020. The coloring of the groups and division of year helps to see all of the data on a single graph as opposed to multiple graphs with the counts.
# create bars of counts for each gender
fig = px.histogram(rides, x="trip_year", color="gender", barmode="group")
# sets axis labels and centers title
fig.update_layout(xaxis_title="Trip Year", yaxis_title="Count", title="Number of Riders Per Year Divided by Gender")
fig.update(layout=dict(title=dict(x=0.5)))
# display graph
fig.show()
With 1 representing males, 2 representing females, and 0 representing unknown genders, it seems that in both years there are higher proportions of men riders compared to female or unknown gender riders.
Since we will be conducting some analysis into the popularity of different stations, and since this data revolves around stations and Citi Bike trips, let’s map out where these stations are.
We'll start by creating a dataframe of all of the unique start stations, and all of the unique end stations.
# only keep one occurrence of each start station
start_stations = rides.drop_duplicates('start station id')
start_stations = start_stations.reset_index(drop=True)
# only keep one occurrence of each end station
end_stations = rides.drop_duplicates('end station id')
end_stations = end_stations.reset_index(drop=True)
From here, we can extract the respective latitudes and longitudes to position our markers, and then use Folium to display it on a map. Each marker is clickable to reveal the name of the station it represents.
# extract latitude and longitude of all end stations
end_locations = end_stations[['end station latitude', 'end station longitude']]
locationlist = end_locations.values.tolist()
# create base folium map
marker_map = folium.Map(location=[40.7440, -74.0060], zoom_start=12)
# loop through all station points and add a marker with popup
for point in range(0, len(locationlist)):
folium.Marker(locationlist[point], popup=end_stations['end station name'][point]).add_to(marker_map)
# add title
title_html = '''
<h3 align="center" style="font-size:20px"><b>Station Locations</b></h3>
'''
marker_map.get_root().html.add_child(folium.Element(title_html))
# display map
marker_map
From this map, we can see that there is a large cluster of stations just under Hoboken, and several stations spread out over New York City. Based on this, we can probably expect to see a lot of activity in those area since more stations are probably put in areas of high ridership.
From our descriptive statistics, we know that there is a high standard deviation for the trip durations in both years. In order to see why this is the case, lets see the distribution of trip duration in a histogram. Histograms are useful in summarizing numerical data by showing how many points fall within a specific range of values (learn more here).
Once again, we will use Plotly to show the histograms for each year overlaid on each other so we can compare both years with a mutual axis.
# create overlaid histogram with 75 bins
fig = px.histogram(rides, x="tripduration_minutes", nbins=75, color="trip_year", barmode="overlay")
# sets axis labels and centers title
fig.update_layout(xaxis_title="Trip Duration", yaxis_title="Count", title="Histogram of Trip Duration")
fig.update(layout=dict(title=dict(x=0.5)))
# displays graph
fig.show()
We can see that the trip duration spreads from 0 - 300 minutes, with most trips falling under 1 hour in length. The highest trip duration range was 5 - 9 minutes, which makes sense since getting from one station to another over bike should not take too long. The reason our standard deviation was so high was because although most trips are concentrated under an hour, there are still many trips falling between 1 - 5 hours.
The output above also yields some interesting insight on the difference in duration between the two years. Because there were overall less rides in 2020, we would assume the counts for the red histogram (2020) should always lie under the purple histogram. However, for trip durations ranging from nearly 15 to 120 minutes, the count for 2020 seems to be higher than the count for 2019. This shows us that in 2020, people seemed to take longer rides as compared to 2019.
The last part of our exploratory analysis will consist of figuring out what the top 10 trips for each year are. Since each ride has a start and stop station, we define a trip as the ride from Station A to Station B. Some of these trips can be loops, indicating that the user started and returned to the same station in their trip.
We can find the top trips using panda's groupby, which allows you to group entries in a DataFrame by certain values and perform operations on the groups.
# function that creates a dataframe of most popular trips with counts of rides per station
def most_popular_trips(rides):
# group rides by start and end station
trips = pd.DataFrame()
trips = rides.groupby(['start station name','end station name']).size().reset_index(name = 'Number of Trips')
# sort trips by descending number of trips
trips = trips.sort_values('Number of Trips', ascending = False)
trips['start station name'] = trips['start station name'].astype(str)
trips['end station name'] = trips['end station name'].astype(str)
# Rename trip to readable format
trips["Trip"] = trips['start station name'] + " to " + trips['end station name']
# drop seperate start and end station names as we now have that information in readable string in Trip column
trips = trips.drop(['start station name', 'end station name'], axis = 1)
trips = trips.reset_index()
return trips
# top trips for 2019
trips_2019 = most_popular_trips(rides_2019)
trips_2019.head(10)
index | Number of Trips | Trip | |
---|---|---|---|
0 | 973 | 4429 | Hamilton Park to Grove St PATH |
1 | 919 | 2982 | Grove St PATH to Hamilton Park |
2 | 216 | 2673 | Brunswick & 6th to Grove St PATH |
3 | 1725 | 2215 | Marin Light Rail to Grove St PATH |
4 | 1377 | 2140 | Jersey & 6th St to Grove St PATH |
5 | 935 | 2079 | Grove St PATH to Marin Light Rail |
6 | 1829 | 2002 | Monmouth and 6th to Grove St PATH |
7 | 265 | 1985 | Brunswick St to Grove St PATH |
8 | 623 | 1935 | Dixon Mills to Grove St PATH |
9 | 1808 | 1865 | McGinley Square to Sip Ave |
# top trips for 2020
trips_2020 = most_popular_trips(rides_2020)
trips_2020.head(10)
index | Number of Trips | Trip | |
---|---|---|---|
0 | 1541 | 3537 | Liberty Light Rail to Liberty Light Rail |
1 | 2067 | 2284 | Newport Pkwy to Newport Pkwy |
2 | 1702 | 1440 | Marin Light Rail to Marin Light Rail |
3 | 1189 | 1193 | JC Medical Center to JC Medical Center |
4 | 1594 | 1127 | Lincoln Park to Lincoln Park |
5 | 975 | 1036 | Harborside to Harborside |
6 | 1765 | 1035 | McGinley Square to Sip Ave |
7 | 921 | 1000 | Hamilton Park to Hamilton Park |
8 | 920 | 907 | Hamilton Park to Grove St PATH |
9 | 2315 | 889 | Sip Ave to McGinley Square |
It looks like the station "Grove St PATH" was extremely popular in 2019, but not so much in 2020. It's possible that this station is in a very popular area that was much less populated during the pandemic. It is also interesting to note that almost all of the top trips in 2020 are loops. This could be because riders in 2019 used Citi Bike to get from one place to another, but riders in 2020 were more along the lines of "joyriders" who went for a quick ride and returned to where they started. This could be due to the fact that office buildings were closed with the lockdown, so more users were visitors to NYC who fell under a "tourist" label.
Now that we have a much better understanding of what our data contains, we can go ahead and begin the "data analysis" stage. This is the core stage in the data science pipeline that involved analyzing and modeling the data so that we can answer our original hypothesis.
Timeseries visualizations are a good way to identify trends in your data as time goes on. They allow us to see the direction or movement of certain variables in our data, and offer insights into areas to focus our analysis on. It is important to note that a normal time series with many dates can have a lot of fluctuation, but it is still useful to note trends.
Let's make a timeseries for our trip data. We will once again use groupby to get our counts per date, and we'll use Plotly to visualize the graph.
# group all rides by date of ride and get counts per date
date_groups = rides.groupby("trip_date").starttime.count()
# create timeseries Count vs Date
fig = px.line(x=date_groups.index, y=date_groups)
# sets axis labels and centers title
fig.update_layout(xaxis_title="Trip Date", yaxis_title="Count", title="Timeseries of Number of Rides Over Time")
fig.update(layout=dict(title=dict(x=0.5)))
# display timeseries
fig.show()
From our timeseries plot, we can make a few key observations. First, in 2019, we can see that there is a spike in ridership in the summer months, possibly due to good weather. Then, there is a dip on September 2nd, or Labor Day. Here we might hypothesize that it may due to people being with family and at home. In terms of seasonality, we see that there is a downward trend in ridership as weather gets colder in December.
In 2020, we can see some interesting deviations from our 2019 trends, most of which can be mapped to different phases of NYC's plan to halt the spread of the COVID-19 virus (learn more here). March 23rd was NYC’s first official day of lockdown, and we can see in our plot that this was the lowest point in the time series with only 24 rides. The fact that there are any rides at all may be due to essential workers. The months of March and April were uncharacteristically low, followed by a spike in June. This spike can be tied to phase one of NYC's reopening plan, which began on June 8th.
One of the most interesting parts of this visualization is the very large spike in the beginning of October 2020. These few days had the highest count of trips even when compared to 2019. However, when looking at NYC's reopening plan, we can see that phase four of the plan commenced on October 8th, marking the opening of museums, botanical gardens, and gyms. With thousands of people stuck at home for months beforehand, this reopening could have led to a very large spike in visits to NYC, thus increasing the ridership significantly.
Now that we have established the raw count differences over time, let's see if we can break it down into its locations and analyze the the popularity of different trips and stations more in depth. In the exploration phase, we did some elementary counts on the most popular stations and trips, but in order to see where ridership is physically concentrated and how it changes over time, let’s create some maps using Folium.
Even though we listed the 10 most popular trips in the exploratory phase, it can be extremely helpful to be able to visualize their locations. We'll be using Folium's PolyLine feature to connect two station markers.
# function that searches the rides dataframe to extract the starting and ending latitude and longitude based on station name
# sets the columns of dataframe with correct values respectively
def get_lat_long(row):
# extract station names
start_station = row.start_station
end_station = row.end_station
# lookup respective coordinates and set row value
row.start_lat = rides.loc[rides['start station name'] == start_station, 'start station latitude'].iloc[0]
row.start_long = rides.loc[rides['start station name'] == start_station, 'start station longitude'].iloc[0]
row.end_lat = rides.loc[rides['end station name'] == end_station, 'end station latitude'].iloc[0]
row.end_long = rides.loc[rides['end station name'] == end_station, 'end station longitude'].iloc[0]
# return updated row
return row
# function to create the map of popular trips
# takes in trips dataframe created in exploratory analysis
def map_popular_trips(trips):
# extract top 5 trips
trips = trips[:5]
# split trip string to get start and stop station for each trip
stations = trips["Trip"].str.split(" to ", n = 1, expand = True)
# initialize values
trips["start_station"] = stations[0]
trips["end_station"] = stations[1]
trips["start_lat"] = 0
trips["start_long"] = 0
trips["end_lat"] = 0
trips["end_long"] = 0
# use get_lat_long function to populate columns for coordinate
trips = trips.apply(get_lat_long, axis=1)
# set colors for lines
colors = ["blue", "red", "green", "purple", "black"]
# create base maps
tripmap = folium.Map(location=[40.7197, -74.0431], zoom_start=14)
# loop through every trip and add marker for each station, as well as line connecting the two
for index, row in trips.iterrows():
points = []
points.append((row['start_lat'],row['start_long']))
folium.Marker((row['start_lat'],row['start_long']), popup=row['start_station']).add_to(tripmap)
points.append((row['end_lat'],row['end_long']))
folium.Marker((row['end_lat'],row['end_long']), popup=row['end_station']).add_to(tripmap)
folium.PolyLine(points, color=colors[index], weight=2.5, opacity=1).add_to(tripmap)
# return populated map
return tripmap
# map trips for 2019
trip2019 = map_popular_trips(trips_2019)
# add title
title_html = '''
<h3 align="center" style="font-size:20px"><b>Popular Trips in 2019</b></h3>
'''
trip2019.get_root().html.add_child(folium.Element(title_html))
trip2019
As we had seen in the chart of trips from our exploratory analysis, the top 5 paths all pass through Grove St PATH. From the map we can see that this station is in a very popular and central location. It is likely that this station is located near a lot of popular spots in the city, which leads to a higher ridership.
# map trips for 2020
trip2020 = map_popular_trips(trips_2020)
# add title
title_html = '''
<h3 align="center" style="font-size:20px"><b>Popular Trips in 2020</b></h3>
'''
trip2020.get_root().html.add_child(folium.Element(title_html))
trip2020
Unlike 2019, 2020 does not have any trips going through Grove St PATH. All of the popular trips are loops, and we can further justify our claim that the riders in 2020 are more likely to be "joyriders." If we look at where some of the stations are located, we can see that the stations are right next to Lincoln Park, Liberty State Park, Alexander F Santora Park, St Peters Field, and Elephant Park. These proximities lead us to believe that the users using these bikes are most likely taking a ride around the park to get some fresh air during lockdown.
Overall, this visualization helps us see the difference in the types of users based on their potential motives for using Citi Bike's services.
Now that we've seen the popular trips, lets break it down even further by station. First, we will create a heatmap for each year showing the concentration of ridership in the NYC area. Heatmaps are a great way to see clustering and magnitude of data using color. You can learn more about heatmaps here.
# import required folium plugins
from folium.plugins import HeatMap, HeatMapWithTime
# create base map
heatmap_2019 = folium.Map(location=[40.7440, -74.0060], zoom_start=12)
# add 2019 start and end station from coordinates to map
HeatMap(data=rides_2019[['start station latitude', 'start station longitude']], radius=10).add_to(heatmap_2019)
HeatMap(data=rides_2019[['end station latitude', 'end station longitude']], radius=10).add_to(heatmap_2019)
# add title
title_html = '''
<h3 align="center" style="font-size:20px"><b>Heatmap of Rides in 2019</b></h3>
'''
heatmap_2019.get_root().html.add_child(folium.Element(title_html))
# display map
heatmap_2019