Over the next 20 years, electric vehicles (EVs) are predicted to make up at least half of all passenger car sales. We were able to find a dataset that gave us a lot of different information about charging EVs. We decided that the most beneficial use of this dataset would be from a business perspective about designing additional EV charging stations.
We want to use this dataset to make recommendations to someone who is planning to build a new electric vehicle charging station. What insights can we give them to make it successful?
This dataset contains 3,395 rows that represent sessions from 85 Electric Vehicle drivers with repeat usage at 105 charging stations across 25 sites at a workplace charging program.
The dataset had many valuable variables such as “Charging Duration” and “KwH Total Usage” which we used frequently as a basis for our analysis. On the other hand, there were columns such as reportedZip which only indicated whether or not a Zip Code was reported rather than the Zip Code itself. This would have allowed us to visualize a map of charging station usage in a certain area. Other columns called “station id and location id” could have told us valuable information on the actual charging station, but did not provide any reference to the Id values for us to use.
Finally, this dataset contains data from January 2014 to October 2015. More recent data would have given our findings more integrity and meaning to the current state of Electric Vehicles with the creation of newer electric vehicles and large brands such as Tesla in the market. However, the analysis that we did complete gives a useful representation of the early state of Electric Vehicles.
The data was collected through the U.S. Department of Energy (DOE) workplace charging challenge. All the drivers represented in this dataset are part of the DOE and charging data was collected at workplace locations such as research and innovation centers, manufacturing, testing facilities and office headquarters.
The creators of this dataset had completed a project titled: “A Field Experiment on Workplace Norms and Electric Vehicle Charging Etiquette” which we did not use or look at. We found the dataset on the author’s Github account which was already in CSV format and used the raw data for our own analysis. The data contained several NA values and column values in textual data types which we changed to numeric data to better suit our analysis. There was no statistical or exploratory data analysis that we used as a basis. After finding the dataset, we preprocessed and cleaned the raw data ourselves and came to our own conclusions without referencing or consulting the original authors’ code or visualizations.
It would be beneficical if we had specific locations of the charging stations rather than just a random number as a location ID and the actual zip code that the users report, rather than a 1 for reported and 0 for not reported. Both these details would help us figure out what part of the US has the most charging station usage which can give us an idea about where the best place to get started would be. Additionally, more specific information about Electric Vehicle sales would also help us to decide where stations should be built.
According the graph, most EV owners use IOS as their primary method of checking their charging status, while using the web is clearly not preferred.
Since most of the charging stations used in this dataset are part of the DOE's workplace charging challenge, so most of the charging done is during the week while EV owners are at work. Later, we will look at the relationship between the types of facility's where charging stations are held and the day of the week EV owners decide to charge their vehicles.
From the bar chart above, it's apparent that a majority of charging times last between 2-4 hours.
This time-series model shows EV owner's charging habits from November of 2014 to July of 2015. The frequency of those charging seems to increase from May of 2015 to August of the same year. This might be the influx of people ordering new electric vehicles (the Tesla model X came out that summer), people have more time to charge their vehicles during the summer.
The distribution of charge time durations throughout the day can be grouped into clusters. The first cluster is between 12am-9am where people might leave their EV to charge overnight. The second is during business hours 9am-5pm when EV owners are at work and finally after work when people charge their car at home.
By looking at a single user's charging habits, we get a good idea of realistic charging durations. In this case, the user (who had the highest # of charges) follows the general population of this dataset in that their charge time is frequently around 2 hours long.
There are several possible conclusions we can make from this histogram of facility types in this dataset. Charging stations are generally heavily located at Research and Development facilities; therefore, more people choose to charge their EV there. There is a possibility that a majority of people participating in the workplace charging challenge while also owning an electric vehicle work in R&D facilities.
We wouldn’t really say that our question changed after EDA, but we were definitely able to find out some key points that would help us in answering our overall question. For example, we could tell what platform was most popular for use in viewing charging status. We were able to rule out creating a web platform right away since there were so few users compared to Android and iOS. We were also able to find out some information about how long users typically charge their cars for and what days of the week are the least busy - making them the best days to schedule repairs and maintenance. This all gave us an initial expectation for charging station usage trends, but we still had a lot of information we wanted to find out through other tests and analysis.
We first selected the Z-test because we wanted to see what percentage of users fell within a certain range. We decided to use the Z-test to look at: 1) the distance people drove to reach a station; 2) how long it takes to charge a car; and 3) how many kwh were used in a typical charging session.
We also thought there would be a linear relationship between charge time and the cost, i.e. if the charge time was longer, it would cost more. So we decided to use both a linear and logistic regression model to examine this issue.
In addition, we decided to use ANOVA to see: 1) if there was a variation in the number of charge sessions related to the day of the week; 2) if there was a significant difference in the means of the charge times at each of the four facilities. Since we had two or more groups in both of these examples, ANOVA seemed like the best way to figure out the statistical differences.
Finally, we used the Chi Square Test to see if the day of the week and facility type has any impact on charging. We chose the Chi-Square test this time because this time we had two nominal variables - facility type and day of the week - and we wanted to see whether the proportions of one variable are different for different values of the other variable. Figuring out what day of the week drives the most traffic at each facility will help us figure out which facilities/what days will be the most profitable for us.
We found that 90% of people drove <= 33.29 miles; 50% drove <= 18.65 miles. The distance a customer would drive to reach a charging station would be useful to keep in mind when constructing a new charging station. We also found that 22.4% of people drove 10 miles or less to get to the charging station. Just looking at our graph, we can see that these statistics appear correct.
We found that 90% of charge times were <= 4.77 hours; 50% were <= 2.84 hours. Again, looking at our chart, this makes intuitive sense. This seems to suggest that on average 2-3 people could charge their car at one station per work day.
We found that 90% of sessions use <= 9.52 kwh; 50% of session use <= 5.81 kwh. It would be very useful when setting up a business model to know how much electricity each person is likely to use.
Not surprisingly, we found a linear relationship between charge time and dollars spent.
We created a Linear Regression model that predicts the value of the continuous, dependent variable "dollars" based on all other independent variables in the dataset. Out of all the columns, "chargeTimeHrs" which represents charging duration had the greatest influence on the "dollars" variable which makes sense because the longer one charges their car, the more they have to pay for electricity. We see that the R-squared value for "chargeTimeHrs" influence on "dollars" is 0.970. The R-squared value is the percentage of the the dependent variable's variation that the linear model explains. So in this case, chargeTimeHrs explains 97% of the variation in the dollars column. We can also see the coefficient is 0.9192 which suggests that as the independent variable (in this case, chargeTimeHrs) increases, the values in the dollars column also increases. So, we can conclude that there is a high correlation between charging duration and dollars spent; this is similar to paying for gas for a ICE (Internal Combustion Engine) vehicle.
Null hypothesis: the day of the week does NOT have an effect on the # of charging sessions per day; i.e. there is no variation in the means of the daily groups. Alternative hypothesis: the day of the week does have an effect on the # of charging sessions each day; i.e. there IS variation in the means of the daily groups.
Our F_CV of 1.77 was less than our F_stat of 238.48, so we REJECTED our null hypothesis.
This makes intuitive sense; as we see in our graph, there are clearly fewer vehicles charging on the weekends.
We also reran the ANOVA calculations using only weekdays.
Null hypothesis: the day of the week does NOT have an effect on the # of charging sessions per day; i.e. there is no variation in the means of the daily groups. Alternative hypothesis: the day of the week does have an effect on the # of charging sessions each day; i.e. there IS variation in the means of the daily groups.
Again our F_CV of 1.94 was less than our F_stat of 6.33, so we REJECTED our null hypothesis. There is variation in the number of cars who come to charge even during weekdays. And yet gain, we can see that from our graph.
We also performed an ANOVA test to find the relationship between facility type and charge time.
Our Critical F-Stat of 2.08 was less than our calculated F-Stat of 42.12 so we rejected our null hypothesis which stated that there is no relationship between the facility type a charging station is located and the amount of time users spend charing their vehicle.
This means that there is a particular facility type where it would be beneficial for us to set up our charging stations in order to get the most usage out of it. We can get an idea of what facility that is by looking at the following graph:
It looks like facilities categorized as ‘Other’ seem to have the highest average charge times, while Manufacturing facilities have the lowest average charge times. Office and Research and Development facilities have very similar charge time averages.
We performed the Chi-Square Test to see if there is a relationship between the facility type the station is located at and the day of the week that people use the charging stations.
Our critical value of 25.99 was significantly less than our calculated chi square stat of 297.0 so we rejected our null hypothesis and concluded that there is a relationship between the facility type the station is located at and the day of the week that people use the charging stations. To get a better idea of this realtionship we can look at the following graph:
It looks like research and development facilities have the most usage overall, especially on Thursdays, while Manufacturing and Office facilities have the most usage on Wednesdays. It’s also interesting to note that while manufacturing facilities still have a decent amount of usage on the weekends, the rest of the facilities have close to no usage on weekends. Unless we decide to set up our station at a Manufacturing facility, we should expect traffic to be very low on weekends.
To maximize our profits, we could also set up stations at both Manufacturing and Research and Development facilities so that we get the high volume of users from the research and development facilities on the weekdays, but also still have some traffic coming in on the weekends through the manufacturing facilities.
The line for our linear regression was a very good fit; the correlation was 0.985. For the ANOVA tests we used 0.1 charts.
More data would be interesting -- we only had info on 3395 charging sessions. It would also be interesting to have more recent data -- this data is from 2014-2015. We would also like more information about weekend driving habits of EV drivers. Do they just not drive their car on the weekends? Do they use charging stations closer to home?
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("station_data_dataverse.csv")
pd.options.display.max_columns = None
df.head()
df.tail()
df.describe()
# Print the dimension of df
pd.DataFrame([[df.shape[0], df.shape[1]]], columns=['# rows', '# columns'])
#Get a list of all the columns
list(df.columns)
#station ID value counts
df['stationId'].value_counts()
df['userId'].value_counts()
#platform value counts
df['platform'].value_counts()
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
plt.bar(df['platform'].value_counts().index,df['platform'].value_counts())
plt.ylabel("Count")
plt.xlabel("Charge Status Platform")
plt.title("Distribution of Platforms Used to View Charging Status")
# change the color of the top and right spines to opaque gray
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))
# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()
xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)
# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()
App for iOS and Android -- wouldn't bother with web
#days of week value counts
df['weekday'].value_counts()
fig, ax = plt.subplots()
plt.bar(df['weekday'].value_counts().index,df['weekday'].value_counts())
plt.ylabel("Count of Charging Stations Used")
plt.xlabel("Day of the Week")
plt.title("Charging Station Usage")
# change the color of the top and right spines to opaque gray
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))
# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()
xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)
# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()
If we had to do repairs to the station Saturday/Sunday would be the best day since they are used the least then.
#chargeTimeHrs
fig, ax = plt.subplots(figsize=(4, 4))
data = df['chargeTimeHrs']
plt.hist(data, range(0,11))
plt.ylabel("Frequency Count")
plt.xlabel("Number of Hours")
plt.title("Histogram of Hours of Charge Time")
# change the color of the top and right spines to opaque gray
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))
# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()
xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)
# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()
import matplotlib.ticker as plticker
time = []
date = []
for index,value in df.created.items():
time.append(value.split(' ')[1])
if '00' in value.split(' ')[0]:
date.append(value.split(' ')[0].replace("00","20",1))
dfc = pd.DataFrame(time,columns = ['time'])
dfc['time'] = pd.to_datetime(dfc['time'],format= '%H:%M:%S' ).dt.time
dfc['chargeTimeHrs'] = df['chargeTimeHrs']
dfc['date'] = date
dfc['time'] = time
fig, ax = plt.subplots(figsize=(8, 6))
dfc = dfc[dfc['chargeTimeHrs']!=55.23805556] #Outlier
# Add x-axis and y-axis
ax.bar(dfc['date'],
dfc['chargeTimeHrs'],
color='purple')
# Set title and labels for axes
ax.set(xlabel="Date",
ylabel="chargeTimeHrs",
title="Charge Time\nJune - 2014-2015")
loc = plticker.MultipleLocator(base=20.0) # this locator puts ticks at regular intervals
ax.xaxis.set_major_locator(loc)
plt.xticks(rotation = 45)
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))
# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()
xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)
# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8,6))
dfc = dfc[dfc['chargeTimeHrs']!=55.23805556] #Outlier
# Add x-axis and y-axis
dfc = dfc.sort_values(by='time')
#fuzzy_charge = dfc['chargeTimeHrs'] + np.random.normal(0,1, size=len(dfc['chargeTimeHrs']))
ax.scatter(dfc['time'],
dfc['chargeTimeHrs'],
color='purple',alpha = 0.2)
# Set title and labels for axes
ax.set(xlabel="Time of Day",
ylabel="chargeTimeHrs",
title="Charging Durations Throughout the Day")
loc = plticker.MultipleLocator(base=200) # this locator puts ticks at regular intervals
ax.xaxis.set_major_locator(loc)
plt.xticks(rotation = 45)
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))
# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()
xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)
# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()
data = df.loc[df['userId']== 98345808]
print(data)
plt.hist(data['chargeTimeHrs'], range(0,11))
plt.ylabel("Frequency Count")
plt.xlabel("Number of Hours")
plt.title("Histogram of Hours of Charge Time (Single User)")
# change the color of the top and right spines to opaque gray
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))
# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()
xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)
# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
plt.ylabel("Frequency Count")
plt.xlabel("Days of the Week")
plt.title("Histogram of Charging Days")
# change the color of the top and right spines to opaque gray
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))
# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()
xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)
# tweak the title
ttl = ax.title
ttl.set_weight('bold')
data['weekday'].value_counts().plot(ax=ax, kind='bar')
plt.show()
data2 = df.replace({'facilityType': {1:'Manufacturing',2:'Office', 3:'Research and Development',
4:'Other'}})
totals = data2['facilityType'].value_counts()
plt.figure(figsize=(7,4))
totals.plot(kind='bar', color = 'C0')
plt.title('Facility Type Totals', fontweight='bold')
plt.ylabel('Count \n')
plt.xticks(rotation=45)
plt.show()
fig, ax = plt.subplots()
plt.hist(df.distance)
plt.ylabel("Count")
plt.xlabel("Distance")
plt.title("Distance from Charging Station")
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))
# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()
xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)
# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()
distance = df['distance']
print(distance.head())
print(len(distance))
#drop the nan values
distance = distance.dropna()
print(distance)
import pandas as pd
import scipy.stats as stats
import math
#Using built in Python functions
print("Using built in Python functions")
print("Info about distance:")
my_mean = np.mean(distance)
print("Mean distance:", my_mean)
my_min = np.min(distance)
print("Min distance:", my_min)
my_max = np.max(distance)
print("Max distance:", my_max)
my_variance = np.var(distance)
print("Variance:", my_variance)
my_std = np.std(distance)
print("STD:", my_std)
df['distance'].describe()
#Definition of Z-score
def z_score(x, mean, std):
z_score = (x-mean)/std
return z_score
#Z-score for a distance of 10 miles.
z = z_score(10, my_mean, my_std)
print(z)
#-0.7 indicates 0.7 standard deviations below the mean
#Therefore, 10 miles is considered a below average distance to a charging station
#have z score; would like p
import scipy.stats as st
p = st.norm.cdf(z)
print(p)
#given p = .9 , what is the x distance?
#90% of the customers drive <= what distance?
#First we find z; then we find x
a = .9
z = st.norm.ppf(a)
print("The z-score is", z)
x = (my_std * z) + my_mean
print("The value for x is:", x)
print(a*100, "% of customers drive <=", round(x, 2), "miles.")
#given p = .5 , what is the x distance?
#50% of the customers drive <= what distance?
#First we find z; then we find x
a = .5
z = st.norm.ppf(a)
print("The z-score is", z)
x = (my_std * z) + my_mean
print("The value for x is:", x)
print(a*100, "% of customers drive <=", round(x, 2), "miles.")
chargetime = df['chargeTimeHrs']
print(chargetime.head())
print(len(chargetime))
#drop the nan values
chargetime = chargetime.dropna()
print(chargetime)
#We can see that there were no nan values for chargetime because the original length matches the length after we dropped nan values.
import pandas as pd
import numpy as np
import scipy.stats as stats
import math
#Using built in Python functions
print("Using built in Python functions")
print("Info about Charge time:")
my_mean = np.mean(chargetime)
print("Mean charge time:", my_mean)
my_min = np.min(chargetime)
print("Min charge time:", my_min)
my_max = np.max(chargetime)
print("Max charge time:", my_max)
my_variance = np.var(chargetime)
print("Variance:", my_variance)
my_std = np.std(chargetime)
print("STD:", my_std)
#Definition of Z-score
def z_score(x, mean, std):
z_score = (x-mean)/std
return z_score
#Z-score for a charge time of 5 hours.
z = z_score(5, my_mean, my_std)
print(z)
#have z score; would like p
import scipy.stats as st
p = st.norm.cdf(z)
print(p)
#given p = .9 , what is the x chargetime?
#90% of the charge times are <= what x?
#First we find z; then we find x
a = .9
z = st.norm.ppf(a)
print("The z-score is", z)
x = (my_std * z) + my_mean
print("The value for x is:", x)
print(a*100, "% of the charge times <=", round(x, 2), "hours.")
#given p = .5 , what is the x chargetime?
#90% of the charge times are <= what x?
#First we find z; then we find x
a = .5
z = st.norm.ppf(a)
print("The z-score is", z)
x = (my_std * z) + my_mean
print("The value for x is:", x)
print(a*100, "% of the charge times <=", round(x, 2), "hours.")
fig, ax = plt.subplots()
plt.hist(df.kwhTotal)
plt.ylabel("Count")
plt.xlabel("Kilowatt-hours")
plt.title("kWh Used During Charging")
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))
# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()
xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)
# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()
kwh = df['kwhTotal']
print(kwh.head())
print(len(kwh))
#drop the nan values
kwh = kwh.dropna()
print(kwh)
#We can see that there were no nan values for kwh because the original length matches the length after we dropped nan values.
import pandas as pd
import numpy as np
import scipy.stats as stats
import math
#Using built in Python functions
print("Using built in Python functions")
print("Info about kwh:")
my_mean = np.mean(kwh)
print("Mean kwh:", my_mean)
my_min = np.min(kwh)
print("Min kwh:", my_min)
my_max = np.max(kwh)
print("Max kwh:", my_max)
my_variance = np.var(kwh)
print("Variance:", my_variance)
my_std = np.std(kwh)
print("STD:", my_std)
#Definition of Z-score
def z_score(x, mean, std):
z_score = (x-mean)/std
return z_score
#Z-score for 8 kwh.
z = z_score(8, my_mean, my_std)
print(z)
#have z score; would like p
import scipy.stats as st
p = st.norm.cdf(z)
print(p)
#given p = .9 , what is the x kwh?
#90% of the kwhs are < what x?
#First we find z; then we find x
a = .9
z = st.norm.ppf(a)
print("The z-score is", z)
x = (my_std * z) + my_mean
print("The value for x is:", x)
print(a*100, "% of the kwhh are <=", round(x, 2), "kwh.")
#given p = .5 , what is the x kwh?
#90% of the kwhs are < what x?
#First we find z; then we find x
a = .5
z = st.norm.ppf(a)
print("The z-score is", z)
x = (my_std * z) + my_mean
print("The value for x is:", x)
print(a*100, "% of the kwhh are <=", round(x, 2), "kwh.")
data = df[['kwhTotal','dollars', 'chargeTimeHrs', 'distance']]
data = data.dropna()
print(data)
data.corr()
#Plotting chargeTimeHrs and dollars
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
plt.scatter(data['chargeTimeHrs'], data['dollars'])
plt.show()
#drop zero values
#first replace zero values with nan
data = data.replace(0, np.nan)
#then drop nan values again
data = data.dropna()
print(data)
data.corr()
#Now we can see that chargeTimeHrs and dollars are very highly correlated
#Plotting chargeTimeHrs and dollars
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
plt.scatter(data['chargeTimeHrs'], data['dollars'])
plt.show()
# There is a linear relationship between charge time hours and dollars.
import matplotlib.pyplot as plt
from scipy import stats
slope,intercept, r, p, std_err =stats.linregress(data['chargeTimeHrs'],data['dollars'])
def myfunc(x):
return slope*x+intercept
mymodel=list(map(myfunc,data['chargeTimeHrs']))
plt.scatter(data['chargeTimeHrs'], data['dollars'])
plt.plot(data['chargeTimeHrs'], mymodel)
plt.ylabel("Dollars Spent")
plt.xlabel("Hours of Charge Time")
plt.title("Charge Time Hours vs. Cost")
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))
# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()
xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)
# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()
plt.show()
H0: The columns are not that different, i.e. no variation in the means of groups
H1: At least one group mean is different from the others.
days = df[['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']]
print(days)
days.describe()
#Find group mean for each column and grand mean
import numpy as np
mean_mon = 0
mean_tues = 0
mean_wed = 0
mean_thurs = 0
mean_fri = 0
mean_sat = 0
mean_sun = 0
for (columnName, columnData) in days.iteritems():
if columnName == 'Mon':
mean_mon = np.mean(columnData)
elif columnName == 'Tues':
mean_tues = np.mean(columnData)
elif columnName == 'Wed':
mean_wed = np.mean(columnData)
elif columnName == 'Thurs':
mean_thurs = np.mean(columnData)
elif columnName == 'Fri':
mean_fri = np.mean(columnData)
elif columnName == 'Sat':
mean_sat = np.mean(columnData)
elif columnName == 'Sun':
mean_sun = np.mean(columnData)
print("The means of the seven columns are:")
print(f"Monday: {mean_mon}")
print(f"Tuesday: {mean_tues}")
print(f"Wednesday: {mean_wed}")
print(f"Thursday: {mean_thurs}")
print(f"Friday: {mean_fri}")
print(f"Saturday: {mean_sat}")
print(f"Sunday: {mean_sun}")
print('\n')
all_days_values = days.values.flatten()
grand_mean = np.mean(all_days_values)
print(f"The grand mean is: {grand_mean}")
sst = 0
for value in all_days_values:
sst = sst + (value-grand_mean)**2
print(sst)
ssw = 0
for (columnName, columnData) in days.iteritems():
if columnName == 'Mon':
ssw = ssw + sum((columnData - mean_mon)**2)
elif columnName == 'Tues':
ssw = ssw + sum((columnData - mean_tues)**2)
elif columnName == 'Wed':
ssw = ssw + sum((columnData - mean_wed)**2)
elif columnName == 'Thurs':
ssw = ssw + sum((columnData - mean_thurs)**2)
elif columnName == 'Fri':
ssw = ssw + sum((columnData - mean_fri)**2)
elif columnName == 'Sat':
ssw = ssw + sum((columnData - mean_sat)**2)
elif columnName == 'Sun':
ssw = ssw + sum((columnData - mean_sun)**2)
print(ssw)
ssb = 0
num_of_values = len(days['Mon'])
mean_list = [mean_mon,mean_tues,mean_wed,mean_thurs,mean_fri,mean_sat,mean_sun]
for mean in mean_list:
ssb = ssb + num_of_values * (mean-grand_mean)**2
print(ssb)
#Checking work
if round(sst) == round(ssw) + round(ssb):
print("Your numbers make sense!")
else:
print("Try again")
#Degrees of Freedom
m = 7
n = 3395
sst_df = (m * n) - 1
ssw_df = m * (n-1)
ssb_df = m-1
print(sst_df, ssw_df, ssb_df)
#Checking work
if sst_df == ssw_df + ssb_df:
print("Your numbers make sense!")
else:
print("Try again")
#Calculating msb
msb = ssb/ssb_df
print(msb)
#Calculating mse
mse = ssw/ssw_df
print(mse)
F_stat = msb/mse
print("F-stat:", F_stat)
#Checking results with pre-built function
results = stats.f_oneway(days['Mon'], days['Tues'], days['Wed'], days['Thurs'], days['Fri'], days['Sat'], days['Sun'])
print(results)
#From F Table for alpha = 0.1
ssw_df = 23758
ssb_df = 6
F_CV = 1.7741
#reject hypothesis if:
#F_CV < F_stat
F_CV < F_stat
We reject the null hypothesis and conclude that the day of the week is influential in charging an electric vehicle
datap = df[df['weekday']!='Sun']
datao = datap[datap['weekday']!='Sat']
dat1 = datao.groupby(["weekday"],as_index=False)["chargeTimeHrs"].sum()
dat1
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(7,5))
dat1.plot(kind='bar', color = 'C0')
plt.title('Average Charging Duration on Weekdays \n')
plt.xlabel('Day of the Week')
plt.ylabel('Average Charging Duration (Hours)')
locs, labels = plt.xticks()
plt.xticks(np.arange(5), ('Fri','Mon','Thurs','Tues','Wed'))
plt.xticks(rotation=360)
plt.show()
weekdays = df[['Mon', 'Tues', 'Wed', 'Thurs', 'Fri']]
print(weekdays)
weekdays.describe()
#Find group mean for each column and grand mean
import numpy as np
wk_mean_mon = 0
wk_mean_tues = 0
wk_mean_wed = 0
wk_mean_thurs = 0
wk_mean_fri = 0
for (columnName, columnData) in days.iteritems():
if columnName == 'Mon':
wk_mean_mon = np.mean(columnData)
elif columnName == 'Tues':
wk_mean_tues = np.mean(columnData)
elif columnName == 'Wed':
wk_mean_wed = np.mean(columnData)
elif columnName == 'Thurs':
wk_mean_thurs = np.mean(columnData)
elif columnName == 'Fri':
wk_mean_fri = np.mean(columnData)
print("The means of the five columns are:")
print(f"Monday: {wk_mean_mon}")
print(f"Tuesday: {wk_mean_tues}")
print(f"Wednesday: {wk_mean_wed}")
print(f"Thursday: {wk_mean_thurs}")
print(f"Friday: {wk_mean_fri}")
print('\n')
wk_days_values = weekdays.values.flatten()
wk_grand_mean = np.mean(wk_days_values)
print(f"The grand mean is: {wk_grand_mean}")
wk_sst = 0
for value in wk_days_values:
wk_sst = wk_sst + (value-wk_grand_mean)**2
print(wk_sst)
wk_ssw = 0
for (columnName, columnData) in days.iteritems():
if columnName == 'Mon':
wk_ssw = wk_ssw + sum((columnData - wk_mean_mon)**2)
elif columnName == 'Tues':
wk_ssw = wk_ssw + sum((columnData - wk_mean_tues)**2)
elif columnName == 'Wed':
wk_ssw = wk_ssw + sum((columnData - wk_mean_wed)**2)
elif columnName == 'Thurs':
wk_ssw = wk_ssw + sum((columnData - wk_mean_thurs)**2)
elif columnName == 'Fri':
wk_ssw = wk_ssw + sum((columnData - wk_mean_fri)**2)
print(wk_ssw)
wk_ssb = 0
wk_num_of_values = len(weekdays['Mon'])
wk_mean_list = [mean_mon,mean_tues,mean_wed,mean_thurs,mean_fri]
for mean in wk_mean_list:
wk_ssb = wk_ssb + wk_num_of_values * (mean-wk_grand_mean)**2
print(wk_ssb)
#Checking work
#Checking work
if round(wk_sst) == round(wk_ssw) + round(wk_ssb):
print("Your numbers make sense!")
else:
print("Try again")
#Degrees of Freedom
wk_m = 5
wk_n = 3395
wk_sst_df = (wk_m * wk_n) - 1
wk_ssw_df = wk_m * (wk_n-1)
wk_ssb_df = wk_m-1
print(wk_sst_df, wk_ssw_df, wk_ssb_df)
#Checking work
if wk_sst_df == wk_ssw_df + wk_ssb_df:
print("Your numbers make sense!")
else:
print("Try again")
#Calculating msb
wk_msb = wk_ssb/wk_ssb_df
print(wk_msb)
#Calculating mse
wk_mse = wk_ssw/wk_ssw_df
print(wk_mse)
wk_F_stat = wk_msb/wk_mse
print("F-stat:", wk_F_stat)
#Checking work with pre-built function
results = stats.f_oneway(weekdays['Mon'], weekdays['Tues'], weekdays['Wed'], weekdays['Thurs'], weekdays['Fri'])
print(results)
#From F Table for alpha = 0.1
wk_ssw_df = 16970
wk_ssb_df = 4
wk_F_CV = 1.9449
#reject hypothesis if:
#F_CV < F_stat
wk_F_CV < wk_F_stat
We reject the null hypothsis and assume that there is at least one group mean that is statistically and significantly different from the others. In context, there is at least one day from Monday to Friday where significantly more or significantly fewer number people charge their Electric Vehicle.
Null Hypothesis = The facility type where a charging station is located will NOT have an impact on charge time of users.
Alternate Hypothesis = The facility type where a charging station is located will have an impact on charge time of users.
#creating df for charge time and facility type
df_fac = data2.loc[:,['chargeTimeHrs','facilityType']]
df_fac.head()
#pulling out information based on facility type
a = df_fac.loc[df_fac['facilityType'] == 'Office']
b = df_fac.loc[df_fac['facilityType'] == 'Manufacturing']
c = df_fac.loc[df_fac['facilityType'] == 'Research and Development']
d = df_fac.loc[df_fac['facilityType'] == 'Other']
#using built in stats function to calculate the f-statistic using charge time hrs by facility type
#each 'column' would contain the charge time hours by facility type
results = stats.f_oneway(a['chargeTimeHrs'], b['chargeTimeHrs'], c['chargeTimeHrs'], d['chargeTimeHrs'])
print(results) #gives us F-Stat Calculated
#finding degrees of freedom
m = 4
n = 3395
df_ssb = m-1
df_ssw = m*(n-1)
df_ssb, df_ssw
#Using table for alpha = .1 get F-CV
F_CV = 2.0838
F_Stat = 42.12424294508296
if F_CV < F_Stat:
print('Reject Null Hypothesis')
else:
print('Accept Null Hypothesis')
Since the F-Critical Value is less than our calculated F-Statistic value , we would REJECT our null hypothesis, meaning that the facility type does have an impact on the amount of time people spend charging their cars. This can help us understand what type of facility we should set up our charging station as to get the highest usage of our station.
facility_charge_avg = df_fac.groupby(["facilityType"])["chargeTimeHrs"].mean()
plt.figure(figsize=(7,4))
facility_charge_avg.plot(kind='bar', color = 'C0')
plt.title('Average Charging Duration at Different Facilities', fontweight='bold')
plt.xlabel('Facility Type')
plt.ylabel('Avg Charge Times Hours')
plt.xticks(rotation=45)
plt.show()
Null hypothesis = The facility type the charging station is located at and the day of the week people use the charging station are NOT related.
Alternate Hypothesis = The facility type the charging station is located at and the day of the week people use the charging station are related.
#creating dataframe containing weekday and facility type
df_chi = data2.loc[:, ['facilityType', 'weekday']]
df_chi.head()
#gives me a series of the amount of charging station uses by facility type and weekday
observed = df_chi.groupby(['facilityType', 'weekday']).size()
observed
#making a list out of the values in the observed series
observed_list = observed.values.tolist()
#adding in 0's for the days that there is no usage - office on sunday, other on sunday and saturday (adding 3 0's)
observed_list.insert(10, 0)
observed_list.insert(16, 0)
observed_list.insert(17, 0)
print(observed_list)
#expected = row total * column total)/grand total (formula from in class assignment 5)
column_totals = totals
row_totals = observed.groupby(['weekday']).sum()
grand_total = 3395
column_totals
row_totals
#finding expected values by facility type
manufacturing_val = list(((row_totals)*593)/grand_total)
office_val = list(((row_totals)*862)/grand_total)
other_val = list(((row_totals)*108)/grand_total)
r_val = list(((row_totals)*1832)/grand_total)
#joining all 4 lists together to get all the expected values
expected_values = manufacturing_val + office_val + other_val + r_val
print(expected_values)
#creating dataeframe with observed and expected lists
values = {'observed': observed_list, 'expected': expected_values}
df_final = pd.DataFrame(data=values)
#chi square stat = (observed-expected)^2/expected.sum()
chi_squared_stat = (((df_final['observed'] - df_final['expected'])**2)/df_final['expected']).sum()
print(round(chi_squared_stat))
#df = (#rows-1)*(#columns-1); alpha = .1
#using built in function to get critical value
crit = stats.chi2.ppf(q = 0.90, df = 18)
print(round(crit, 2))
Since our critical value(25.99) is significantly less than our calculated chi square stat (297.0), we can REJECT our null hypothesis and conclude that there is a relationship between the facility type the station is located at and the day of the week that people use the charging stations.
This helps us understand what user patterns we should expect depending on where we set up our station.
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
import plotly.graph_objs as go
#turning observed series into a dataframe
observed_df = pd.DataFrame(data=observed).reset_index()
#adding rows for 0 values - if I don't do this, the graph will mess up the order of the values
observed_df.loc[9.5] = 'Office', 'Sun', 0
observed_df_new = observed_df.sort_index().reset_index(drop=True)
observed_df_new.loc[15.5] = 'Other', 'Sat', 0
observed_df_new = observed_df_new.sort_index().reset_index(drop=True)
observed_df_new.loc[16.5] = 'Other', 'Sun', 0
observed_df_new = observed_df_new = observed_df_new.sort_index().reset_index(drop=True)
#making lists for graph
f_types = ['Manufacturing', 'Office','Other', 'Research and Development']
weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
#looping through weekdays so each bar represents the number of uses on each day
fig = go.Figure()
for weekday in weekdays:
fig.add_trace(go.Bar(
x=f_types,
y=observed_df_new[0].loc[observed_df_new['weekday'] == weekday],
name= weekday
))
fig.update_layout(barmode='group', xaxis_tickangle=-45, title_text='Usage by Facility Type and Day of the Week',
height = 500, width = 700)
fig.show()
#Importing Libraries
import seaborn as sns
from statsmodels.formula.api import glm
from statsmodels.formula.api import ols
import statsmodels.api as sm
data1 = df[['kwhTotal','dollars', 'chargeTimeHrs', 'distance','weekday','platform','facilityType']]
data1 = data1.dropna()
data1 = data1.replace(0, np.nan)
#then drop nan values again
data1 = data1.dropna()
print(data1)
modelDollarsDistance = ols(formula='dollars ~ distance', data=data1)
modelDollarsDistanceFit = modelDollarsDistance.fit()
modelpredictions = pd.DataFrame( columns=['actual_dollars'], data= data1)
modelpredictions['actual_distance'] = data1['distance']
print( modelDollarsDistanceFit.summary() )
modelDollarsWeekday = ols(formula='dollars ~ C(weekday)', data=data1)
modelDollarsWeekdayFit = modelDollarsWeekday.fit()
modelpredictions['dollars_weekday'] = modelDollarsWeekdayFit.predict(data1)
modelpredictions['actual_weekday'] = data1['weekday']
print( modelDollarsWeekdayFit.summary() )
modelDollarsFacility = ols(formula='dollars ~ C(facilityType)', data=data1)
modelDollarsFacilityFit = modelDollarsFacility.fit()
#modelpredictions['dollars_facility'] = modelDollarsFacilityFit.predict(data1)
#modelpredictions['actual_facilityType'] = data1['facilityType']
modelpredictions['actual_dollars'] = data1['dollars']
print( modelDollarsFacilityFit.summary() )
modelDollarsChargeTime = ols(formula='dollars ~ chargeTimeHrs', data=data1)
modelDollarsChargeTimeFit = modelDollarsChargeTime.fit()
modelpredictions['dollars_chargeTimeHrs'] = modelDollarsChargeTimeFit.predict(data1)
modelpredictions['actual_chargeTimeHrs'] = data1['chargeTimeHrs']
print( modelDollarsChargeTimeFit.summary() )
modelDollarsKWH = ols(formula='dollars ~ kwhTotal', data=data1)
modelDollarsKWHFit = modelDollarsKWH.fit()
#modelpredictions['dollars_kwhTotal'] = modelDollarsKWHFit.predict(data1)
#modelpredictions['actual_kwhTotal'] = data1['kwhTotal']
print( modelDollarsKWHFit.summary() )
modelDollarsPlatform = ols(formula='dollars ~ C(platform)', data=data1)
modelDollarsPlatformFit = modelDollarsPlatform.fit()
#modelpredictions['dollars_platform'] = modelDollarsPlatformFit.predict(data1)
#modelpredictions['actual_platform'] = data1['platform']
print( modelDollarsPlatformFit.summary() )
print( modelpredictions)