My Google Data Analytics Capstone: Cyclistic Bike-Share Case Study
How does a bike-share navigate speedy success?
Background to the Study
The goals for which businesses are set up other than providing services, is profit making. And to identify factors that profits depend on so as to optimize its various dimensions, has been at the forefront of several business analyses in the past. In the same vein Cyclistic, a bike-share company launched in 2016, with a fleet of 5,824 bicycles that are geo-tracked and locked into a network of 692 stations across Chicago. Whose bikes can be unlocked from one station and returned to any other station in the system anytime has opted to undergo in this painstaking endeavor to convert their casual subscribers to annual members, thereby increasing their profit margin in the near future.
Cyclistic as would other businesses has a flexible pricing package that attracts more subscribers however, the annual members are more profitable compared to the casual riders and will be key to its future growth. So therefore, converting casual riders who differ from members in several ways but are already aware of the type of service Cyclistic render would be ideal than engaging in entirely different marketing campaign endeavors that attracts new customers.
Business Task
Cyclistic offers different packages of services to its bike riders. There are packages for single-ride subscribers, full-day subscribers, and annual subscribers. Subscribers who opted for single-ride or full-day subscription are referred to as casual riders, while subscribers who opted for annual memberships are Cyclistic members.
It is important to note that Cyclistic are open to several kinds of service they offer too, in that they offer reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who cannot use a standard two-wheeled bike. The majority of riders opt for traditional bikes, about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
It is in the effort to understand how annual members and casual riders use Cyclistic bikes differently that this work is conducted.
Stakeholders
- Lily Moreno — The director of marketing and my manager.
- Cyclistic marketing analytics team — A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide the company’s marketing strategy. Which I joined six month ago.
- Cyclistic executive team — The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
Ask
Below is the guiding question to help in the analysis:
- How do annual members and casual riders use Cyclistic bikes differently?
Prepare
Data for the study is Cyclistic’s historical trip data, which was made available by my manager in zip files. It was sourced from Motivate International Inc. under this license. Data needed for this particular project were previous 12 months of Cyclistic trip data ranging from August 2021 to July 2022, hence, the data is said to be reliable because they were the most recent customer data collected. Although it was sourced by a third-party, which may not be the same as in-house data.
ROCC Check
The reliability of the data is very high because it has a large sample size (over 5million observations in a 13 columns data frame.) However, the originality of the data is low because it was sourced from Motivate International Inc., a third-party firm. But the same cannot be said of its comprehensiveness which is high. The data has several variables than needed for this particular project. Also, the data is current. It is the previous 12months customer data collected from August 2021 to July 2022, hence the data is recent considering this project was conducted in August 2022. And finally, the data has been severally cited by Google Data Analytics program students in their capstone projects, thus the data has a medium citation.
I proceeded to extract the files into a new folder, using appropriate file-naming conventions. Due to the volume of the data, RStudio was utilized for the cleaning and final analysis of the data. To do that, tidyverse, for data import and wrangling, lubridate for date functions and ggplot for visualization packages were installed and loaded.
Process
Having confirmed the credibility and how relevant the data is to solving my business task, I proceeded to clean it so that my analysis will be error-free by importing the dataset into RStudio using read.csv() function.
library(tidyverse)
library(lubridate)
install.packages(“dplyr”)
library(dplyr)
library(lubridate)
getwd()
setwd(“C:/Users/ogbon/OneDrive/Desktop/Cyclistic_folder”)
aug_21_divvy_tripdata <- read.csv(“C:/Users/ogbon/OneDrive/Desktop/Cyclistic_folder/202108-divvy-tripdata.csv”)
View(aug_21_divvy_tripdata)
sep_21_divvy_tripdata <- read.csv(“C:/Users/ogbon/OneDrive/Desktop/Cyclistic_folder/202109-divvy-tripdata.csv”)
View(sep_21_divvy_tripdata)
oct_21_divvy_tripdata <- read.csv(“C:/Users/ogbon/OneDrive/Desktop/Cyclistic_folder/202110-divvy-tripdata.csv”)
View(oct_21_divvy_tripdata)
nov_21_divvy_tripdata <- read.csv(“C:/Users/ogbon/OneDrive/Desktop/Cyclistic_folder/202111-divvy-tripdata.csv”)
View(nov_21_divvy_tripdata)
dec_21_divvy_tripdata <- read.csv(“C:/Users/ogbon/OneDrive/Desktop/Cyclistic_folder/202112-divvy-tripdata.csv”)
View(dec_21_divvy_tripdata)
jan_22_divvy_tripdata <- read.csv(“C:/Users/ogbon/OneDrive/Desktop/Cyclistic_folder/202201-divvy-tripdata.csv”)
View(jan_22_divvy_tripdata)
feb_22_divvy_tripdata <- read.csv(“C:/Users/ogbon/OneDrive/Desktop/Cyclistic_folder/202202-divvy-tripdata.csv”)
View(feb_22_divvy_tripdata)
mar_22_divvy_tripdata <- read.csv(“C:/Users/ogbon/OneDrive/Desktop/Cyclistic_folder/202203-divvy-tripdata.csv”)
View(mar_22_divvy_tripdata)
apr_22_divvy_tripdata <- read.csv(“C:/Users/ogbon/OneDrive/Desktop/Cyclistic_folder/202204-divvy-tripdata.csv”)
View(apr_22_divvy_tripdata)
may_22_divvy_tripdata <- read.csv(“C:/Users/ogbon/OneDrive/Desktop/Cyclistic_folder/202205-divvy-tripdata.csv”)
View(may_22_divvy_tripdata)
jun_22_divvy_tripdata <- read.csv(“C:/Users/ogbon/OneDrive/Desktop/Cyclistic_folder/202206-divvy-tripdata.csv”)
View(jun_22_divvy_tripdata)
jul_22_divvy_tripdata <- read.csv(“C:/Users/ogbon/OneDrive/Desktop/Cyclistic_folder/202207-divvy-tripdata.csv”)
View(jul_22_divvy_tripdata)
I proceeded to view each of the dataset uploaded to get a glimpse of the data. I equally checked for column names to confirm that they were consistent with each other, this will go further to help in the analysis.
colnames(aug_21_divvy_tripdata)
colnames(sep_21_divvy_tripdata)
colnames(oct_21_divvy_tripdata)
colnames(nov_21_divvy_tripdata)
colnames(dec_21_divvy_tripdata)
colnames(jan_22_divvy_tripdata)
colnames(feb_22_divvy_tripdata)
colnames(mar_22_divvy_tripdata)
colnames(apr_22_divvy_tripdata)
colnames(may_22_divvy_tripdata)
colnames(jun_22_divvy_tripdata)
colnames(jul_22_divvy_tripdata)
Next, inspect the dataset to look for incongruencies
str(aug_21_divvy_tripdata)
str(sep_21_divvy_tripdata)
str(oct_21_divvy_tripdata)
str(nov_21_divvy_tripdata)
str(dec_21_divvy_tripdata)
str(jan_22_divvy_tripdata)
str(feb_22_divvy_tripdata)
str(mar_22_divvy_tripdata)
str(apr_22_divvy_tripdata)
str(may_22_divvy_tripdata)
str(jun_22_divvy_tripdata)
str(jul_22_divvy_tripdata)
The columns were also renamed in line with Divvy table design.
aug_21_divvy_tripdata <- aug_21_divvy_tripdata %>%
rename(“trip_id” = “ride_id”,
“bike_id” = “rideable_type”,
“start_time” = “started_at”,
“end_at” = “ended_at”,
“from_station_name” = “start_station_name”,
“from_station_id” = “start_station_id”,
“to_station_name” = “end_station_name”,
“to_station_id” = “end_station_id”,
“usertype” = “member_casual”)
sep_21_divvy_tripdata <- sep_21_divvy_tripdata %>%
rename(“trip_id” = “ride_id”,
“bike_id” = “rideable_type”,
“start_time” = “started_at”,
“end_at” = “ended_at”,
“from_station_name” = “start_station_name”,
“from_station_id” = “start_station_id”,
“to_station_name” = “end_station_name”,
“to_station_id” = “end_station_id”,
“usertype” = “member_casual”)
oct_21_divvy_tripdata <- oct_21_divvy_tripdata %>%
rename(“trip_id” = “ride_id”,
“bike_id” = “rideable_type”,
“start_time” = “started_at”,
“end_at” = “ended_at”,
“from_station_name” = “start_station_name”,
“from_station_id” = “start_station_id”,
“to_station_name” = “end_station_name”,
“to_station_id” = “end_station_id”,
“usertype” = “member_casual”)
nov_21_divvy_tripdata <- nov_21_divvy_tripdata %>%
rename(“trip_id” = “ride_id”,
“bike_id” = “rideable_type”,
“start_time” = “started_at”,
“end_at” = “ended_at”,
“from_station_name” = “start_station_name”,
“from_station_id” = “start_station_id”,
“to_station_name” = “end_station_name”,
“to_station_id” = “end_station_id”,
“usertype” = “member_casual”)
dec_21_divvy_tripdata <- dec_21_divvy_tripdata %>%
rename(“trip_id” = “ride_id”,
“bike_id” = “rideable_type”,
“start_time” = “started_at”,
“end_at” = “ended_at”,
“from_station_name” = “start_station_name”,
“from_station_id” = “start_station_id”,
“to_station_name” = “end_station_name”,
“to_station_id” = “end_station_id”,
“usertype” = “member_casual”)
jan_22_divvy_tripdata <- jan_22_divvy_tripdata %>%
rename(“trip_id” = “ride_id”,
“bike_id” = “rideable_type”,
“start_time” = “started_at”,
“end_at” = “ended_at”,
“from_station_name” = “start_station_name”,
“from_station_id” = “start_station_id”,
“to_station_name” = “end_station_name”,
“to_station_id” = “end_station_id”,
“usertype” = “member_casual”)
feb_22_divvy_tripdata <- feb_22_divvy_tripdata %>%
rename(“trip_id” = “ride_id”,
“bike_id” = “rideable_type”,
“start_time” = “started_at”,
“end_at” = “ended_at”,
“from_station_name” = “start_station_name”,
“from_station_id” = “start_station_id”,
“to_station_name” = “end_station_name”,
“to_station_id” = “end_station_id”,
“usertype” = “member_casual”)
mar_22_divvy_tripdata <- mar_22_divvy_tripdata %>%
rename(“trip_id” = “ride_id”,
“bike_id” = “rideable_type”,
“start_time” = “started_at”,
“end_at” = “ended_at”,
“from_station_name” = “start_station_name”,
“from_station_id” = “start_station_id”,
“to_station_name” = “end_station_name”,
“to_station_id” = “end_station_id”,
“usertype” = “member_casual”)
apr_22_divvy_tripdata <- apr_22_divvy_tripdata %>%
rename(“trip_id” = “ride_id”,
“bike_id” = “rideable_type”,
“start_time” = “started_at”,
“end_at” = “ended_at”,
“from_station_name” = “start_station_name”,
“from_station_id” = “start_station_id”,
“to_station_name” = “end_station_name”,
“to_station_id” = “end_station_id”,
“usertype” = “member_casual”)
may_22_divvy_tripdata <- may_22_divvy_tripdata %>%
rename(“trip_id” = “ride_id”,
“bike_id” = “rideable_type”,
“start_time” = “started_at”,
“end_at” = “ended_at”,
“from_station_name” = “start_station_name”,
“from_station_id” = “start_station_id”,
“to_station_name” = “end_station_name”,
“to_station_id” = “end_station_id”,
“usertype” = “member_casual”)
jun_22_divvy_tripdata <- jun_22_divvy_tripdata %>%
rename(“trip_id” = “ride_id”,
“bike_id” = “rideable_type”,
“start_time” = “started_at”,
“end_at” = “ended_at”,
“from_station_name” = “start_station_name”,
“from_station_id” = “start_station_id”,
“to_station_name” = “end_station_name”,
“to_station_id” = “end_station_id”,
“usertype” = “member_casual”)
jul_22_divvy_tripdata <- jul_22_divvy_tripdata %>%
rename(“trip_id” = “ride_id”,
“bike_id” = “rideable_type”,
“start_time” = “started_at”,
“end_at” = “ended_at”,
“from_station_name” = “start_station_name”,
“from_station_id” = “start_station_id”,
“to_station_name” = “end_station_name”,
“to_station_id” = “end_station_id”,
“usertype” = “member_casual”)
Next, I merged individual datasets into a big data frame. However, before doing that, I needed to check that the individual columns contain the same data type by running the glimpse() function. The outcome was clear to bind into one big data frame by running the below code.
previous_12months_trips <- bind_rows(aug_21_divvy_tripdata, sep_21_divvy_tripdata, oct_21_divvy_tripdata,
nov_21_divvy_tripdata, dec_21_divvy_tripdata, jan_22_divvy_tripdata,
feb_22_divvy_tripdata, apr_22_divvy_tripdata, may_22_divvy_tripdata,
jun_22_divvy_tripdata, jul_22_divvy_tripdata)
Next, I proceeded to remove columns that were not needed in this analysis by using the appropriate command. The process will help to reduce columns in the dataframe from 13 to 9.
previous_12months_trips <- previous_12months_trips %>%
select(-c(start_lat, start_lng, end_lat, end_lng))
Equally, I inspected the new data frame by running the following commands
colnames(previous_12months_trips) #List of column names
nrow(previous_12months_trips) #How many rows are in the data frame?
dim(previous_12months_trips) #Dimensions of the data frame?
head(previous_12months_trips) #See the first 6 rows of the data frame.
tail(previous_12months_trips) #See the last 6 rows of the data frame
str(previous_12months_trips) #See list of columns and data types (numeric, character, datetime)
summary(previous_12months_trips) #Statistical summary of data, for numeric data.
At this stage, having removed all columns that were not needed, I started adding columns that will aid in solving the business task. Columns that list the date, month, day, and year of each ride. This will allow us to aggregate ride data for each month and day. Before completing these operations we could only aggregate at the ride level, and below were the commands that helped in achieving that:
previous_12months_trips$date <- as.Date(previous_12months_trips$start_time) # The default format is yyyy-mm-dd
previous_12months_trips$month <- format(as.Date(previous_12months_trips$date), “%m”)
previous_12months_trips$day <- format(as.Date(previous_12months_trips$date), “%d”)
previous_12months_trips$year <- format(as.Date(previous_12months_trips$date), “%Y”)
previous_12months_trips$day_of_week <- format(as.Date(previous_12months_trips$date), “%A”)
previous_12months_trips$ride_length <- difftime(previous_12months_trips$end_at,previous_12months_trips$start_time)
I also inspected the structure of our new data frame to check that the above codes run properly, and further converted the ride_length column from Factor to Numeric to enable us to run calculations in the data.
str(previous_12months_trips)
is.factor(previous_12months_trips$ride_length)
previous_12months_trips$ride_length <- as.numeric(as.character(previous_12months_trips$ride_length))
is.numeric(previous_12months_trips$ride_length)
Now, there are some bad data in our frame which needs to be removed. This will be corrected by creating a cleaner version of our data frame (v2).
previous_12months_trips_v2 <- previous_12months_trips[!(previous_12months_trips$from_station_name == “HQ QR” | previous_12months_trips$ride_length<0),]
Analysis
Descriptive Analysis
At this stage of the project, the need to solve for mean, median, maximum and minimum values of the ride_length in seconds arises. Of course, solving for the measures of central tendency helps in answering our business task. To do that, below are the codes to execute
mean(previous_12months_trips_v2$ride_length) #Straight average (total ride length divide rides)
median(previous_12months_trips_v2$ride_length) #Midpoint number in ascending order of ride length
max(previous_12months_trips_v2$ride_length) #Longest ride
min(previous_12months_trips_v2$ride_length) #Least ride
Next, is data aggregation where I compared the two users, members and casual users. I equally solved for the average ride time by each day for members and casual users.
aggregate(previous_12months_trips_v2$ride_length ~ previous_12months_trips_v2$usertype, FUN = mean)
aggregate(previous_12months_trips_v2$ride_length ~ previous_12months_trips_v2$usertype, FUN = median)
aggregate(previous_12months_trips_v2$ride_length ~ previous_12months_trips_v2$usertype, FUN = max)
aggregate(previous_12months_trips_v2$ride_length ~ previous_12months_trips_v2$usertype, FUN = min)
aggregate(previous_12months_trips_v2$ride_length ~ previous_12months_trips_v2$usertype + previous_12months_trips_v2$day_of_week, FUN = mean)
However, the day of the week are not ordered which needs to be fixed by executing the below code
previous_12months_trips_v2$day_of_week <- ordered(previous_12months_trips_v2$day_of_week, levels=c(“Sunday”, “Monday”, “Tuesday”,”Wednesday”, “Thursday”, “Friday”, “Saturday”))
Next, I solved for the average ride time by each day for members and casual users.
aggregate(previous_12months_trips_v2$ride_length ~ previous_12months_trips_v2$usertype + previous_12months_trips_v2$day_of_week, FUN = mean)
I equally analyze ridership data by type and weekday
previous_12months_trips_v2 %>%
mutate(weekday = wday(start_time, label = TRUE)) %>% #Creates weekday field using wday
group_by(usertype,weekday) %>% #Groups by usertype and weekday
summarize(number_of_rides = n() #number of ride and average duration
,average_duration = mean(ride_length)) %>% #calculates the average duration
arrange(usertype,weekday) #sorts
Share
I am now in the share stage of my project, where I communicate my findings to my stakeholders in visualization. The ggplot2() will be utilized here, it has already been imported at the beginning of this work.
Visualize the number of rides by rider type.
From the column bar above, it can be seen that the annual members constitute the majority of rides in the Cyclistic bike-share system in the past one year under review. With about 3,379,154 rides (57.3%) as against the casual riders with about 2,522,160 rides (42.7%).
Visualize the number of rides by months.
From the chart above, it can be seen that there is an increase in bike rental for both users. With the summer period having the highest bike rental for both annual members and casual users. This trend continues for casual users until late August when there is a decline. However, the annual members continue this trend until late October and experience some form of decline in November, which is maintained until December.
I equally visualize the number of rides by week. This will further help in the analysis by showing the most popular day of the week where most rides occur.
Visualize the most popular day of the week.
Again, the above column chart shows that Saturday and Sunday is the most popular day of the week for both types of users. Although casual users tend to use Cyclistic bike-share more on weekends. Conversely, annual members happen to use Cyclistic rentals more on weekdays, while for casual users the weekday rental is low.
Act
At this stage, the need to summarize the key findings and make recommendations arises.
Findings
- Over 57% of Cyclistic users are annual members in the past 12months under review, from August 2021 to July 2022.
- While the remaining 42.7% users are made of the casual riders.
- The study also shows that the number of rides for both users are on the increase during the summer period.
- And that Saturday is the most popular day where the highest number of rides are recorded, followed by Sundays for both users.
- The study equally finds that annual members uses Cyclistic bikes more on weekdays whereas, the casual users prefer weekends.
With the above findings, one can conclude just as their name suggests, that the casual users use Cyclistic bikes for leisure most especially, during the months of summer. Still, looking at the third chart, we can also conclude that the annual members majorly commute to work with Cyclistic bikes or other non leisure activities.
Recommendations
This study is conducted at the behest of my manager to analyze how annual members and casual riders use Cyclistic bikes differently. So therefore, having gotten to this stage, the following are my recommendations:
- A marketing campaign should be targeted at the casual users during weekends and summer period, which this study has found to be highest for casual riders. Of course, some incentives should accompany such campaigns in the form of discounts for annual subscriptions. This approach will help the casual users make the needed switch.
- This study also found that casual users mostly rent for leisure purposes hence, I recommend that Cyclistic go into collaboration with hotels and gyms and other leisure outlets who in turn will recommend to casual users the benefit of becoming a Cyclistic member.
- Finally, it is important that Cyclistic maintain what it currently does to retaining the annual members, by sampling opinion on how best they can continue in serving them better.