Electric Car Charging Station Analysis¶

Group 4: Harish Ram, Kristin Levine, Sabina Azim¶

Background¶

Over the next 20 years, electric vehicles (EVs) are predicted to make up at least half of all passenger car sales. We were able to find a dataset that gave us a lot of different information about charging EVs. We decided that the most beneficial use of this dataset would be from a business perspective about designing additional EV charging stations.

SMART Question¶

We want to use this dataset to make recommendations to someone who is planning to build a new electric vehicle charging station. What insights can we give them to make it successful?

How do people interact with the station? (operating system)
What days have the most use?
How long do people spend charging their car?
How much electricity do they use?
At which facilities are charging stations most frequently used?

About the Dataset¶

This dataset contains 3,395 rows that represent sessions from 85 Electric Vehicle drivers with repeat usage at 105 charging stations across 25 sites at a workplace charging program.

Limitations of the Dataset¶

The dataset had many valuable variables such as “Charging Duration” and “KwH Total Usage” which we used frequently as a basis for our analysis. On the other hand, there were columns such as reportedZip which only indicated whether or not a Zip Code was reported rather than the Zip Code itself. This would have allowed us to visualize a map of charging station usage in a certain area. Other columns called “station id and location id” could have told us valuable information on the actual charging station, but did not provide any reference to the Id values for us to use.

Finally, this dataset contains data from January 2014 to October 2015. More recent data would have given our findings more integrity and meaning to the current state of Electric Vehicles with the creation of newer electric vehicles and large brands such as Tesla in the market. However, the analysis that we did complete gives a useful representation of the early state of Electric Vehicles.

How was the information gathered?¶

The data was collected through the U.S. Department of Energy (DOE) workplace charging challenge. All the drivers represented in this dataset are part of the DOE and charging data was collected at workplace locations such as research and innovation centers, manufacturing, testing facilities and office headquarters.

The creators of this dataset had completed a project titled: “A Field Experiment on Workplace Norms and Electric Vehicle Charging Etiquette” which we did not use or look at. We found the dataset on the author’s Github account which was already in CSV format and used the raw data for our own analysis. The data contained several NA values and column values in textual data types which we changed to numeric data to better suit our analysis. There was no statistical or exploratory data analysis that we used as a basis. After finding the dataset, we preprocessed and cleaned the raw data ourselves and came to our own conclusions without referencing or consulting the original authors’ code or visualizations.

What additional information would be beneficial?¶

It would be beneficical if we had specific locations of the charging stations rather than just a random number as a location ID and the actual zip code that the users report, rather than a 1 for reported and 0 for not reported. Both these details would help us figure out what part of the US has the most charging station usage which can give us an idea about where the best place to get started would be. Additionally, more specific information about Electric Vehicle sales would also help us to decide where stations should be built.

Exploratory Data Analysis¶

According the graph, most EV owners use IOS as their primary method of checking their charging status, while using the web is clearly not preferred.

Since most of the charging stations used in this dataset are part of the DOE's workplace charging challenge, so most of the charging done is during the week while EV owners are at work. Later, we will look at the relationship between the types of facility's where charging stations are held and the day of the week EV owners decide to charge their vehicles.

From the bar chart above, it's apparent that a majority of charging times last between 2-4 hours.

This time-series model shows EV owner's charging habits from November of 2014 to July of 2015. The frequency of those charging seems to increase from May of 2015 to August of the same year. This might be the influx of people ordering new electric vehicles (the Tesla model X came out that summer), people have more time to charge their vehicles during the summer.

The distribution of charge time durations throughout the day can be grouped into clusters. The first cluster is between 12am-9am where people might leave their EV to charge overnight. The second is during business hours 9am-5pm when EV owners are at work and finally after work when people charge their car at home.

By looking at a single user's charging habits, we get a good idea of realistic charging durations. In this case, the user (who had the highest # of charges) follows the general population of this dataset in that their charge time is frequently around 2 hours long.

There are several possible conclusions we can make from this histogram of facility types in this dataset. Charging stations are generally heavily located at Research and Development facilities; therefore, more people choose to charge their EV there. There is a possibility that a majority of people participating in the workplace charging challenge while also owning an electric vehicle work in R&D facilities.

How did your question change, if at all, after Exploratory Data Analysis?¶

We wouldn’t really say that our question changed after EDA, but we were definitely able to find out some key points that would help us in answering our overall question. For example, we could tell what platform was most popular for use in viewing charging status. We were able to rule out creating a web platform right away since there were so few users compared to Android and iOS. We were also able to find out some information about how long users typically charge their cars for and what days of the week are the least busy - making them the best days to schedule repairs and maintenance. This all gave us an initial expectation for charging station usage trends, but we still had a lot of information we wanted to find out through other tests and analysis.

How did you select and determine the correct models to answer your questions?¶

We first selected the Z-test because we wanted to see what percentage of users fell within a certain range. We decided to use the Z-test to look at: 1) the distance people drove to reach a station; 2) how long it takes to charge a car; and 3) how many kwh were used in a typical charging session.

We also thought there would be a linear relationship between charge time and the cost, i.e. if the charge time was longer, it would cost more. So we decided to use both a linear and logistic regression model to examine this issue.

In addition, we decided to use ANOVA to see: 1) if there was a variation in the number of charge sessions related to the day of the week; 2) if there was a significant difference in the means of the charge times at each of the four facilities. Since we had two or more groups in both of these examples, ANOVA seemed like the best way to figure out the statistical differences.

Finally, we used the Chi Square Test to see if the day of the week and facility type has any impact on charging. We chose the Chi-Square test this time because this time we had two nominal variables - facility type and day of the week - and we wanted to see whether the proportions of one variable are different for different values of the other variable. Figuring out what day of the week drives the most traffic at each facility will help us figure out which facilities/what days will be the most profitable for us.

Predictions From our Models¶

Z-Test: How far will people drive to reach a charging station?¶

We found that 90% of people drove <= 33.29 miles; 50% drove <= 18.65 miles. The distance a customer would drive to reach a charging station would be useful to keep in mind when constructing a new charging station. We also found that 22.4% of people drove 10 miles or less to get to the charging station. Just looking at our graph, we can see that these statistics appear correct.

Z-Test: How long does a charge take?¶

We found that 90% of charge times were <= 4.77 hours; 50% were <= 2.84 hours. Again, looking at our chart, this makes intuitive sense. This seems to suggest that on average 2-3 people could charge their car at one station per work day.

Z-Test: How many kwh are used per session?¶

We found that 90% of sessions use <= 9.52 kwh; 50% of session use <= 5.81 kwh. It would be very useful when setting up a business model to know how much electricity each person is likely to use.

Correlation¶

Not surprisingly, we found a linear relationship between charge time and dollars spent.

Logistic Regression¶

We created a Linear Regression model that predicts the value of the continuous, dependent variable "dollars" based on all other independent variables in the dataset. Out of all the columns, "chargeTimeHrs" which represents charging duration had the greatest influence on the "dollars" variable which makes sense because the longer one charges their car, the more they have to pay for electricity. We see that the R-squared value for "chargeTimeHrs" influence on "dollars" is 0.970. The R-squared value is the percentage of the the dependent variable's variation that the linear model explains. So in this case, chargeTimeHrs explains 97% of the variation in the dollars column. We can also see the coefficient is 0.9192 which suggests that as the independent variable (in this case, chargeTimeHrs) increases, the values in the dollars column also increases. So, we can conclude that there is a high correlation between charging duration and dollars spent; this is similar to paying for gas for a ICE (Internal Combustion Engine) vehicle.

ANOVA - Charging and Day of the Week - All Days¶

Null hypothesis: the day of the week does NOT have an effect on the # of charging sessions per day; i.e. there is no variation in the means of the daily groups. Alternative hypothesis: the day of the week does have an effect on the # of charging sessions each day; i.e. there IS variation in the means of the daily groups.

Our F_CV of 1.77 was less than our F_stat of 238.48, so we REJECTED our null hypothesis. This makes intuitive sense; as we see in our graph, there are clearly fewer vehicles charging on the weekends.

ANOVA - Charging and Day of the Week - Weekdays Only¶

We also reran the ANOVA calculations using only weekdays.

Null hypothesis: the day of the week does NOT have an effect on the # of charging sessions per day; i.e. there is no variation in the means of the daily groups. Alternative hypothesis: the day of the week does have an effect on the # of charging sessions each day; i.e. there IS variation in the means of the daily groups.

Again our F_CV of 1.94 was less than our F_stat of 6.33, so we REJECTED our null hypothesis. There is variation in the number of cars who come to charge even during weekdays. And yet gain, we can see that from our graph.

ANOVA - Facility Type and Charge Time¶

We also performed an ANOVA test to find the relationship between facility type and charge time.

Our Critical F-Stat of 2.08 was less than our calculated F-Stat of 42.12 so we rejected our null hypothesis which stated that there is no relationship between the facility type a charging station is located and the amount of time users spend charing their vehicle.

This means that there is a particular facility type where it would be beneficial for us to set up our charging stations in order to get the most usage out of it. We can get an idea of what facility that is by looking at the following graph:

It looks like facilities categorized as ‘Other’ seem to have the highest average charge times, while Manufacturing facilities have the lowest average charge times. Office and Research and Development facilities have very similar charge time averages.

Chi-Square Test - Facility Type and Day of the Week¶

We performed the Chi-Square Test to see if there is a relationship between the facility type the station is located at and the day of the week that people use the charging stations.

Our critical value of 25.99 was significantly less than our calculated chi square stat of 297.0 so we rejected our null hypothesis and concluded that there is a relationship between the facility type the station is located at and the day of the week that people use the charging stations. To get a better idea of this realtionship we can look at the following graph:

It looks like research and development facilities have the most usage overall, especially on Thursdays, while Manufacturing and Office facilities have the most usage on Wednesdays. It’s also interesting to note that while manufacturing facilities still have a decent amount of usage on the weekends, the rest of the facilities have close to no usage on weekends. Unless we decide to set up our station at a Manufacturing facility, we should expect traffic to be very low on weekends.

To maximize our profits, we could also set up stations at both Manufacturing and Research and Development facilities so that we get the high volume of users from the research and development facilities on the weekdays, but also still have some traffic coming in on the weekends through the manufacturing facilities.

How reliable are your results?¶

The line for our linear regression was a very good fit; the correlation was 0.985. For the ANOVA tests we used 0.1 charts.

What additional information might improve your results?¶

More data would be interesting -- we only had info on 3395 charging sessions. It would also be interesting to have more recent data -- this data is from 2014-2015. We would also like more information about weekend driving habits of EV drivers. Do they just not drive their car on the weekends? Do they use charging stations closer to home?

Dataset:¶

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NFPQLW

Background Info:¶

https://github.com/asensio-lab/workplace-charging-experiment

https://www.cnn.com/2019/05/15/business/electric-car-outlook-bloomberg/index.html

Appendix¶

Exploratory Data Analysis (EDA)¶

import pandas as pd
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("station_data_dataverse.csv")
pd.options.display.max_columns = None
df.head()

df.tail()

df.describe()

# Print the dimension of df
pd.DataFrame([[df.shape[0], df.shape[1]]], columns=['# rows', '# columns'])

#Get a list of all the columns
list(df.columns)

['sessionId',
 'kwhTotal',
 'dollars',
 'created',
 'ended',
 'startTime',
 'endTime',
 'chargeTimeHrs',
 'weekday',
 'platform',
 'distance',
 'userId',
 'stationId',
 'locationId',
 'managerVehicle',
 'facilityType',
 'Mon',
 'Tues',
 'Wed',
 'Thurs',
 'Fri',
 'Sat',
 'Sun',
 'reportedZip']

#station ID value counts
df['stationId'].value_counts()

369001    334
474204    213
955429    190
228137    104
878706     91
         ... 
488364      1
946482      1
286084      1
265601      1
300866      1
Name: stationId, Length: 105, dtype: int64

df['userId'].value_counts()

98345808    192
35897499    170
81375624    160
65023200    147
32751774    130
           ... 
27476262      1
36768303      1
25628328      1
17969193      1
32070852      1
Name: userId, Length: 85, dtype: int64

Platform Used to View Charging Status¶

#platform value counts
df['platform'].value_counts()

ios        2234
android    1155
web           6
Name: platform, dtype: int64

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
plt.bar(df['platform'].value_counts().index,df['platform'].value_counts())
plt.ylabel("Count")
plt.xlabel("Charge Status Platform")
plt.title("Distribution of Platforms Used to View Charging Status")
# change the color of the top and right spines to opaque gray
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))

# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()

xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)

# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()

Analysis:¶

App for iOS and Android -- wouldn't bother with web

How often charging stations are used during the week:¶

#days of week value counts
df['weekday'].value_counts()

Thu    735
Wed    713
Tue    635
Mon    616
Fri    610
Sat     62
Sun     24
Name: weekday, dtype: int64

fig, ax = plt.subplots()

plt.bar(df['weekday'].value_counts().index,df['weekday'].value_counts())
plt.ylabel("Count of Charging Stations Used")
plt.xlabel("Day of the Week")
plt.title("Charging Station Usage")
# change the color of the top and right spines to opaque gray
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))

# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()

xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)

# tweak the title
ttl = ax.title
ttl.set_weight('bold')


plt.show()

Analysis:¶

If we had to do repairs to the station Saturday/Sunday would be the best day since they are used the least then.

Looking at charging times:¶

#chargeTimeHrs
fig, ax = plt.subplots(figsize=(4, 4))
data = df['chargeTimeHrs']
plt.hist(data, range(0,11))
plt.ylabel("Frequency Count")
plt.xlabel("Number of Hours")
plt.title("Histogram of Hours of Charge Time")
# change the color of the top and right spines to opaque gray
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))

# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()

xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)

# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()

import matplotlib.ticker as plticker
time = []
date = []
for index,value in df.created.items():
    time.append(value.split(' ')[1])
    if '00' in value.split(' ')[0]:
        date.append(value.split(' ')[0].replace("00","20",1))
dfc = pd.DataFrame(time,columns = ['time'])

dfc['time'] = pd.to_datetime(dfc['time'],format= '%H:%M:%S' ).dt.time

dfc['chargeTimeHrs'] = df['chargeTimeHrs']
dfc['date'] = date
dfc['time'] = time


fig, ax = plt.subplots(figsize=(8, 6))
dfc = dfc[dfc['chargeTimeHrs']!=55.23805556] #Outlier 
# Add x-axis and y-axis
ax.bar(dfc['date'],
       dfc['chargeTimeHrs'],
       color='purple')

# Set title and labels for axes
ax.set(xlabel="Date",
       ylabel="chargeTimeHrs",
       title="Charge Time\nJune - 2014-2015")

loc = plticker.MultipleLocator(base=20.0) # this locator puts ticks at regular intervals
ax.xaxis.set_major_locator(loc)

plt.xticks(rotation = 45)
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))

# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()

xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)

# tweak the title
ttl = ax.title
ttl.set_weight('bold')


plt.show()

import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8,6))
dfc = dfc[dfc['chargeTimeHrs']!=55.23805556] #Outlier 
# Add x-axis and y-axis

dfc = dfc.sort_values(by='time')
#fuzzy_charge = dfc['chargeTimeHrs'] + np.random.normal(0,1, size=len(dfc['chargeTimeHrs']))
ax.scatter(dfc['time'],
       dfc['chargeTimeHrs'],
       color='purple',alpha = 0.2)

# Set title and labels for axes
ax.set(xlabel="Time of Day",
       ylabel="chargeTimeHrs",
       title="Charging Durations Throughout the Day")

loc = plticker.MultipleLocator(base=200) # this locator puts ticks at regular intervals
ax.xaxis.set_major_locator(loc)

plt.xticks(rotation = 45)
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))

# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()

xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)

# tweak the title
ttl = ax.title
ttl.set_weight('bold')


plt.show()

Specific Users¶

data = df.loc[df['userId']== 98345808]
print(data)

      sessionId  kwhTotal  dollars              created                ended  \
1517    2783555      6.62      0.0  0015-03-26 09:31:47  0015-03-26 11:32:05   
1518    9165788      0.26      0.0  0015-03-26 16:26:03  0015-03-26 17:11:05   
1519    4631446      6.69      0.0  0015-03-27 09:01:43  0015-03-27 11:12:04   
1520    1994884      1.39      0.0  0015-03-27 12:02:25  0015-03-27 13:59:06   
1521    2015168      0.19      0.0  0015-03-28 09:59:24  0015-03-28 10:04:05   
...         ...       ...      ...                  ...                  ...   
1704    1197148      6.43      0.0  0015-09-30 09:03:04  0015-09-30 11:03:07   
1705    2775340      6.62      0.0  0015-10-02 09:01:43  0015-10-02 11:15:08   
1706    8383410      0.94      0.0  0015-10-02 17:59:50  0015-10-02 18:19:06   
1707    5487447      6.20      0.0  0015-10-03 09:58:32  0015-10-03 11:53:05   
1708    8483022      1.12      0.0  0015-10-03 15:49:09  0015-10-03 17:49:06   

      startTime  endTime  chargeTimeHrs weekday platform   distance    userId  \
1517          9       11       2.005000     Thu      ios  32.874546  98345808   
1518         16       17       0.750556     Thu      ios  32.874546  98345808   
1519          9       11       2.172500     Fri      ios  32.874546  98345808   
1520         12       13       1.944722     Fri      ios  32.874546  98345808   
1521          9       10       0.078056     Sat      ios  32.874546  98345808   
...         ...      ...            ...     ...      ...        ...       ...   
1704          9       11       2.000833     Wed      ios  32.874546  98345808   
1705          9       11       2.223611     Fri      ios  32.874546  98345808   
1706         17       18       0.321111     Fri      ios  32.874546  98345808   
1707          9       11       1.909167     Sat      ios  32.874546  98345808   
1708         15       17       1.999167     Sat      ios  32.874546  98345808   

      stationId  locationId  managerVehicle  facilityType  Mon  Tues  Wed  \
1517     955429      493904               1             1    0     0    0   
1518     955429      493904               1             1    0     0    0   
1519     369001      493904               1             1    0     0    0   
1520     369001      493904               1             1    0     0    0   
1521     955429      493904               1             1    0     0    0   
...         ...         ...             ...           ...  ...   ...  ...   
1704     369001      493904               1             1    0     0    1   
1705     369001      493904               1             1    0     0    0   
1706     369001      493904               1             1    0     0    0   
1707     369001      493904               1             1    0     0    0   
1708     369001      493904               1             1    0     0    0   

      Thurs  Fri  Sat  Sun  reportedZip  
1517      1    0    0    0            1  
1518      1    0    0    0            1  
1519      0    1    0    0            1  
1520      0    1    0    0            1  
1521      0    0    1    0            1  
...     ...  ...  ...  ...          ...  
1704      0    0    0    0            1  
1705      0    1    0    0            1  
1706      0    1    0    0            1  
1707      0    0    1    0            1  
1708      0    0    1    0            1  

[192 rows x 24 columns]

plt.hist(data['chargeTimeHrs'], range(0,11))
plt.ylabel("Frequency Count")
plt.xlabel("Number of Hours")
plt.title("Histogram of Hours of Charge Time (Single User)")
# change the color of the top and right spines to opaque gray
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))

# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()

xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)

# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
plt.ylabel("Frequency Count")
plt.xlabel("Days of the Week")
plt.title("Histogram of Charging Days")
# change the color of the top and right spines to opaque gray
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))

# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()

xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)

# tweak the title
ttl = ax.title
ttl.set_weight('bold')
data['weekday'].value_counts().plot(ax=ax, kind='bar')
plt.show()

Facility Types¶

data2 = df.replace({'facilityType': {1:'Manufacturing',2:'Office', 3:'Research and Development', 
                     4:'Other'}})

totals = data2['facilityType'].value_counts()
plt.figure(figsize=(7,4))
totals.plot(kind='bar', color = 'C0')
plt.title('Facility Type Totals', fontweight='bold')
plt.ylabel('Count \n')
plt.xticks(rotation=45)
plt.show()

Z-Statistics: Distance from Charging Station¶

fig, ax = plt.subplots()
plt.hist(df.distance)
plt.ylabel("Count")
plt.xlabel("Distance")
plt.title("Distance from Charging Station")
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))

# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()

xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)

# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()

distance = df['distance']
print(distance.head())
print(len(distance))

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
Name: distance, dtype: float64
3395

#drop the nan values
distance = distance.dropna()
print(distance)

343     20.695727
344     20.695727
345     20.695727
346     20.695727
347     20.695727
          ...    
3390    13.352643
3391    13.352643
3392     2.337085
3393     4.671064
3394     3.308334
Name: distance, Length: 2330, dtype: float64

import pandas as pd
import scipy.stats as stats
import math


#Using built in Python functions
print("Using built in Python functions")
print("Info about distance:")

my_mean = np.mean(distance)
print("Mean distance:", my_mean)

my_min = np.min(distance)
print("Min distance:", my_min)

my_max = np.max(distance)
print("Max distance:", my_max)

my_variance = np.var(distance)
print("Variance:", my_variance)

my_std = np.std(distance)
print("STD:", my_std)

df['distance'].describe()

Using built in Python functions
Info about distance:
Mean distance: 18.652378180085535
Min distance: 0.8569106
Max distance: 43.0592916
Variance: 130.37346359819023
STD: 11.41811996776134

count    2330.000000
mean       18.652378
std        11.420571
min         0.856911
25%         5.135871
50%        21.023826
75%        27.285053
max        43.059292
Name: distance, dtype: float64

#Definition of Z-score

def z_score(x, mean, std):
    z_score = (x-mean)/std
    return z_score

#Z-score for a distance of 10 miles.
z = z_score(10, my_mean, my_std)
print(z) 
#-0.7 indicates 0.7 standard deviations below the mean
#Therefore, 10 miles is considered a below average distance to a charging station

-0.7577760791194365

#have z score; would like p
import scipy.stats as st
p = st.norm.cdf(z)
print(p)

0.2242925224941698

#given p = .9 , what is the x distance?
#90% of the customers drive <= what distance?
#First we find z; then we find x
a = .9
z = st.norm.ppf(a)
print("The z-score is", z)
x = (my_std * z) + my_mean 

print("The value for x is:", x)
print(a*100, "% of customers drive <=", round(x, 2), "miles.")

The z-score is 1.2815515655446004
The value for x is: 33.285287700346146
90.0 % of customers drive <= 33.29 miles.

#given p = .5 , what is the x distance?
#50% of the customers drive <= what distance?
#First we find z; then we find x
a = .5
z = st.norm.ppf(a)
print("The z-score is", z)
x = (my_std * z) + my_mean 

print("The value for x is:", x)
print(a*100, "% of customers drive <=", round(x, 2), "miles.")

The z-score is 0.0
The value for x is: 18.652378180085535
50.0 % of customers drive <= 18.65 miles.

Z-Statistics: Charging Duration¶

chargetime = df['chargeTimeHrs']
print(chargetime.head())
print(len(chargetime))

0    1.510556
1    2.177222
2    4.671667
3    1.768333
4    0.298611
Name: chargeTimeHrs, dtype: float64
3395

#drop the nan values
chargetime = chargetime.dropna()
print(chargetime)
#We can see that there were no nan values for chargetime because the original length matches the length after we dropped nan values.

0       1.510556
1       2.177222
2       4.671667
3       1.768333
4       0.298611
          ...   
3390    3.463889
3391    2.971389
3392    3.495556
3393    2.202778
3394    2.983611
Name: chargeTimeHrs, Length: 3395, dtype: float64

import pandas as pd
import numpy as np
import scipy.stats as stats
import math


#Using built in Python functions
print("Using built in Python functions")
print("Info about Charge time:")

my_mean = np.mean(chargetime)
print("Mean charge time:", my_mean)

my_min = np.min(chargetime)
print("Min charge time:", my_min)

my_max = np.max(chargetime)
print("Max charge time:", my_max)

my_variance = np.var(chargetime)
print("Variance:", my_variance)

my_std = np.std(chargetime)
print("STD:", my_std)

Using built in Python functions
Info about Charge time:
Mean charge time: 2.8414876452271014
Min charge time: 0.0125
Max charge time: 55.23805556
Variance: 2.2718022068155204
STD: 1.507249882008793

#Definition of Z-score

def z_score(x, mean, std):
    z_score = (x-mean)/std
    return z_score

#Z-score for a charge time of 5 hours.
z = z_score(5, my_mean, my_std)
print(z)

1.4320865972775085

#have z score; would like p
import scipy.stats as st
p = st.norm.cdf(z)
print(p)

0.923940480160944

#given p = .9 , what is the x chargetime?
#90% of the charge times are <= what x?
#First we find z; then we find x
a = .9
z = st.norm.ppf(a)
print("The z-score is", z)
x = (my_std * z) + my_mean 

print("The value for x is:", x)
print(a*100, "% of the charge times <=", round(x, 2), "hours.")

The z-score is 1.2815515655446004
The value for x is: 4.773106091182385
90.0 % of the charge times <= 4.77 hours.

#given p = .5 , what is the x chargetime?
#90% of the charge times are <= what x?
#First we find z; then we find x
a = .5
z = st.norm.ppf(a)
print("The z-score is", z)
x = (my_std * z) + my_mean 

print("The value for x is:", x)
print(a*100, "% of the charge times <=", round(x, 2), "hours.")

The z-score is 0.0
The value for x is: 2.8414876452271014
50.0 % of the charge times <= 2.84 hours.

Z-Statistics: Total Killowatt-Hours (kWh) Used During Charging¶

fig, ax = plt.subplots()
plt.hist(df.kwhTotal)
plt.ylabel("Count")
plt.xlabel("Kilowatt-hours")
plt.title("kWh Used During Charging")
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))

# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()

xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)

# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()

kwh = df['kwhTotal']
print(kwh.head())
print(len(kwh))

0    7.78
1    9.74
2    6.76
3    6.17
4    0.93
Name: kwhTotal, dtype: float64
3395

#drop the nan values
kwh = kwh.dropna()
print(kwh)
#We can see that there were no nan values for kwh because the original length matches the length after we dropped nan values.

0       7.78
1       9.74
2       6.76
3       6.17
4       0.93
        ... 
3390    6.74
3391    6.86
3392    6.07
3393    5.74
3394    6.95
Name: kwhTotal, Length: 3395, dtype: float64

import pandas as pd
import numpy as np
import scipy.stats as stats
import math


#Using built in Python functions
print("Using built in Python functions")
print("Info about kwh:")

my_mean = np.mean(kwh)
print("Mean kwh:", my_mean)

my_min = np.min(kwh)
print("Min kwh:", my_min)

my_max = np.max(kwh)
print("Max kwh:", my_max)

my_variance = np.var(kwh)
print("Variance:", my_variance)

my_std = np.std(kwh)
print("STD:", my_std)

Using built in Python functions
Info about kwh:
Mean kwh: 5.809628865979388
Min kwh: 0.0
Max kwh: 23.68
Variance: 8.365406165646867
STD: 2.8923011886120826

#Definition of Z-score

def z_score(x, mean, std):
    z_score = (x-mean)/std
    return z_score

#Z-score for 8 kwh.
z = z_score(8, my_mean, my_std)
print(z)

0.7573108715803202

#have z score; would like p
import scipy.stats as st
p = st.norm.cdf(z)
print(p)

0.7755681804101575

#given p = .9 , what is the x kwh?
#90% of the kwhs are < what x?
#First we find z; then we find x
a = .9
z = st.norm.ppf(a)
print("The z-score is", z)
x = (my_std * z) + my_mean 

print("The value for x is:", x)
print(a*100, "% of the kwhh are <=", round(x, 2), "kwh.")

The z-score is 1.2815515655446004
The value for x is: 9.516261982271711
90.0 % of the kwhh are <= 9.52 kwh.

#given p = .5 , what is the x kwh?
#90% of the kwhs are < what x?
#First we find z; then we find x
a = .5
z = st.norm.ppf(a)
print("The z-score is", z)
x = (my_std * z) + my_mean 

print("The value for x is:", x)
print(a*100, "% of the kwhh are <=", round(x, 2), "kwh.")

The z-score is 0.0
The value for x is: 5.809628865979388
50.0 % of the kwhh are <= 5.81 kwh.

Correlations¶

data = df[['kwhTotal','dollars', 'chargeTimeHrs', 'distance']]
data = data.dropna()
print(data)

      kwhTotal  dollars  chargeTimeHrs   distance
343       5.61      0.0       3.413056  20.695727
344       9.03      0.0       3.140278  20.695727
345       6.95      0.0       2.455278  20.695727
346       7.38      0.0       2.483056  20.695727
347       6.69      0.0       2.245833  20.695727
...        ...      ...            ...        ...
3390      6.74      0.0       3.463889  13.352643
3391      6.86      0.0       2.971389  13.352643
3392      6.07      0.0       3.495556   2.337085
3393      5.74      0.0       2.202778   4.671064
3394      6.95      0.0       2.983611   3.308334

[2330 rows x 4 columns]

data.corr()

#Plotting chargeTimeHrs and dollars
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')


plt.scatter(data['chargeTimeHrs'], data['dollars'])
plt.show()

#drop zero values
#first replace zero values with nan
data = data.replace(0, np.nan)
#then drop nan values again
data = data.dropna()
print(data)

      kwhTotal  dollars  chargeTimeHrs   distance
348       6.96     0.67       4.738333  20.695727
358       5.06     3.08       7.072778  20.695727
400       6.76     0.50       4.591389  24.856621
406       6.78     0.83       4.896944  24.856621
411       6.93     0.50       4.260278  24.856621
...        ...      ...            ...        ...
3314      5.11     0.75       4.808056  27.096147
3364      6.67     0.50       4.161389   3.177218
3367      6.70     1.75       5.812222   3.177218
3368      7.07     0.50       4.112222   2.889510
3383      6.94     0.50       4.366111   4.127339

[220 rows x 4 columns]

data.corr()
#Now we can see that chargeTimeHrs and dollars are very highly correlated

#Plotting chargeTimeHrs and dollars
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')


plt.scatter(data['chargeTimeHrs'], data['dollars'])
plt.show()

# There is a linear relationship between charge time hours and dollars.

import matplotlib.pyplot as plt
from scipy import stats
slope,intercept, r, p, std_err =stats.linregress(data['chargeTimeHrs'],data['dollars'])

def myfunc(x):
    return slope*x+intercept

mymodel=list(map(myfunc,data['chargeTimeHrs']))

plt.scatter(data['chargeTimeHrs'], data['dollars'])
plt.plot(data['chargeTimeHrs'], mymodel)
plt.ylabel("Dollars Spent")
plt.xlabel("Hours of Charge Time")
plt.title("Charge Time Hours vs. Cost")
ax.spines['right'].set_color((.8,.8,.8))
ax.spines['top'].set_color((.8,.8,.8))

# tweak the axis labels
xlab = ax.xaxis.get_label()
ylab = ax.yaxis.get_label()

xlab.set_style('italic')
xlab.set_size(10)
ylab.set_style('italic')
ylab.set_size(10)

# tweak the title
ttl = ax.title
ttl.set_weight('bold')
plt.show()
plt.show()

Analysis of Variance (ANOVA) on Charging During the Week¶

Hypothesis:¶

H0: The columns are not that different, i.e. no variation in the means of groups

H1: At least one group mean is different from the others.

days = df[['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']]
print(days)

      Mon  Tues  Wed  Thurs  Fri  Sat  Sun
0       0     1    0      0    0    0    0
1       0     0    1      0    0    0    0
2       0     0    0      0    1    0    0
3       0     0    1      0    0    0    0
4       0     0    0      1    0    0    0
...   ...   ...  ...    ...  ...  ...  ...
3390    0     0    0      1    0    0    0
3391    0     0    0      0    1    0    0
3392    0     0    1      0    0    0    0
3393    0     0    0      1    0    0    0
3394    0     0    0      1    0    0    0

[3395 rows x 7 columns]

days.describe()

#Find group mean for each column and grand mean
import numpy as np

mean_mon = 0
mean_tues = 0
mean_wed = 0
mean_thurs = 0
mean_fri = 0
mean_sat = 0
mean_sun = 0

for (columnName, columnData) in days.iteritems():
    if columnName == 'Mon':
        mean_mon = np.mean(columnData)
    elif columnName == 'Tues':
        mean_tues = np.mean(columnData)
    elif columnName == 'Wed':
        mean_wed = np.mean(columnData)
    elif columnName == 'Thurs':
        mean_thurs = np.mean(columnData)
    elif columnName == 'Fri':
        mean_fri = np.mean(columnData)
    elif columnName == 'Sat':
        mean_sat = np.mean(columnData)
    elif columnName == 'Sun':
        mean_sun = np.mean(columnData)

print("The means of the seven columns are:")
print(f"Monday: {mean_mon}")
print(f"Tuesday: {mean_tues}")
print(f"Wednesday: {mean_wed}")
print(f"Thursday: {mean_thurs}")
print(f"Friday: {mean_fri}")
print(f"Saturday: {mean_sat}")
print(f"Sunday: {mean_sun}")  
print('\n')

all_days_values = days.values.flatten()
grand_mean = np.mean(all_days_values)
print(f"The grand mean is: {grand_mean}")

The means of the seven columns are:
Monday: 0.18144329896907216
Tuesday: 0.187039764359352
Wednesday: 0.21001472754050074
Thursday: 0.21649484536082475
Friday: 0.1796759941089838
Saturday: 0.01826215022091311
Sunday: 0.007069219440353461


The grand mean is: 0.14285714285714285

sst = 0
for value in all_days_values:
    sst = sst + (value-grand_mean)**2
print(sst)

2910.000000001154

ssw = 0
for (columnName, columnData) in days.iteritems():
    if columnName == 'Mon':
        ssw = ssw + sum((columnData - mean_mon)**2)
    elif columnName == 'Tues':
        ssw = ssw + sum((columnData - mean_tues)**2)
    elif columnName == 'Wed':
        ssw = ssw + sum((columnData - mean_wed)**2)
    elif columnName == 'Thurs':
        ssw = ssw + sum((columnData - mean_thurs)**2)
    elif columnName == 'Fri':
        ssw = ssw + sum((columnData - mean_fri)**2)
    elif columnName == 'Sat':
        ssw = ssw + sum((columnData - mean_sat)**2)
    elif columnName == 'Sun':
        ssw = ssw + sum((columnData - mean_sun)**2)
print(ssw)

2744.692194403525

ssb = 0
num_of_values = len(days['Mon'])
mean_list = [mean_mon,mean_tues,mean_wed,mean_thurs,mean_fri,mean_sat,mean_sun]
for mean in mean_list:
    ssb = ssb + num_of_values * (mean-grand_mean)**2
print(ssb)

165.30780559646539

#Checking work
if round(sst) == round(ssw) + round(ssb):
    print("Your numbers make sense!")
else: 
    print("Try again")

Your numbers make sense!

#Degrees of Freedom
m = 7
n = 3395

sst_df = (m * n) - 1
ssw_df = m * (n-1)
ssb_df = m-1

print(sst_df, ssw_df, ssb_df)

23764 23758 6

#Checking work
if sst_df == ssw_df + ssb_df:
    print("Your numbers make sense!")
else:
    print("Try again")

Your numbers make sense!

#Calculating msb
msb = ssb/ssb_df
print(msb)

27.55130093274423

#Calculating mse
mse = ssw/ssw_df
print(mse)

0.11552707275037986

F_stat = msb/mse
print("F-stat:", F_stat)

F-stat: 238.4835024105087

#Checking results with pre-built function
results = stats.f_oneway(days['Mon'], days['Tues'], days['Wed'], days['Thurs'], days['Fri'], days['Sat'], days['Sun'])
print(results)

F_onewayResult(statistic=238.48350241050784, pvalue=4.363886278025376e-297)

#From F Table for alpha = 0.1
ssw_df = 23758
ssb_df = 6
F_CV = 1.7741

#reject hypothesis if:
#F_CV < F_stat
F_CV < F_stat

True

Conclusion:¶

We reject the null hypothesis and conclude that the day of the week is influential in charging an electric vehicle

Analysis of Variance (ANOVA) on Charging During only Weekdays¶

datap = df[df['weekday']!='Sun']
datao = datap[datap['weekday']!='Sat'] 
dat1 = datao.groupby(["weekday"],as_index=False)["chargeTimeHrs"].sum()
dat1

import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(7,5))
dat1.plot(kind='bar', color = 'C0')
plt.title('Average Charging Duration on Weekdays \n')
plt.xlabel('Day of the Week')
plt.ylabel('Average Charging Duration (Hours)')
locs, labels = plt.xticks()
plt.xticks(np.arange(5), ('Fri','Mon','Thurs','Tues','Wed'))
plt.xticks(rotation=360)

plt.show()

<Figure size 504x360 with 0 Axes>

weekdays = df[['Mon', 'Tues', 'Wed', 'Thurs', 'Fri']]
print(weekdays)

      Mon  Tues  Wed  Thurs  Fri
0       0     1    0      0    0
1       0     0    1      0    0
2       0     0    0      0    1
3       0     0    1      0    0
4       0     0    0      1    0
...   ...   ...  ...    ...  ...
3390    0     0    0      1    0
3391    0     0    0      0    1
3392    0     0    1      0    0
3393    0     0    0      1    0
3394    0     0    0      1    0

[3395 rows x 5 columns]

weekdays.describe()

Using ANOVA¶

H0: The columns are not that different, i.e. no variation in the means of groups¶

H1: At least one group mean is different from the others.¶

#Find group mean for each column and grand mean
import numpy as np

wk_mean_mon = 0
wk_mean_tues = 0
wk_mean_wed = 0
wk_mean_thurs = 0
wk_mean_fri = 0


for (columnName, columnData) in days.iteritems():
    if columnName == 'Mon':
        wk_mean_mon = np.mean(columnData)
    elif columnName == 'Tues':
        wk_mean_tues = np.mean(columnData)
    elif columnName == 'Wed':
        wk_mean_wed = np.mean(columnData)
    elif columnName == 'Thurs':
        wk_mean_thurs = np.mean(columnData)
    elif columnName == 'Fri':
        wk_mean_fri = np.mean(columnData)
        
print("The means of the five columns are:")
print(f"Monday: {wk_mean_mon}")
print(f"Tuesday: {wk_mean_tues}")
print(f"Wednesday: {wk_mean_wed}")
print(f"Thursday: {wk_mean_thurs}")
print(f"Friday: {wk_mean_fri}")
print('\n')

wk_days_values = weekdays.values.flatten()
wk_grand_mean = np.mean(wk_days_values)
print(f"The grand mean is: {wk_grand_mean}")

The means of the five columns are:
Monday: 0.18144329896907216
Tuesday: 0.187039764359352
Wednesday: 0.21001472754050074
Thursday: 0.21649484536082475
Friday: 0.1796759941089838


The grand mean is: 0.1949337260677467

wk_sst = 0
for value in wk_days_values:
    wk_sst = wk_sst + (value-wk_grand_mean)**2
print(wk_sst)

2663.9643004420727

wk_ssw = 0
for (columnName, columnData) in days.iteritems():
    if columnName == 'Mon':
        wk_ssw = wk_ssw + sum((columnData - wk_mean_mon)**2)
    elif columnName == 'Tues':
        wk_ssw = wk_ssw + sum((columnData - wk_mean_tues)**2)
    elif columnName == 'Wed':
        wk_ssw = wk_ssw + sum((columnData - wk_mean_wed)**2)
    elif columnName == 'Thurs':
        wk_ssw = wk_ssw + sum((columnData - wk_mean_thurs)**2)
    elif columnName == 'Fri':
        wk_ssw = wk_ssw + sum((columnData - wk_mean_fri)**2)
print(wk_ssw)

2659.9941089837835

wk_ssb = 0
wk_num_of_values = len(weekdays['Mon'])
wk_mean_list = [mean_mon,mean_tues,mean_wed,mean_thurs,mean_fri]
for mean in wk_mean_list:
    wk_ssb = wk_ssb + wk_num_of_values * (mean-wk_grand_mean)**2
print(wk_ssb)

3.9701914580265116

#Checking work
#Checking work
if round(wk_sst) == round(wk_ssw) + round(wk_ssb):
    print("Your numbers make sense!")
else: 
    print("Try again")

Your numbers make sense!

#Degrees of Freedom
wk_m = 5
wk_n = 3395

wk_sst_df = (wk_m * wk_n) - 1
wk_ssw_df = wk_m * (wk_n-1)
wk_ssb_df = wk_m-1

print(wk_sst_df, wk_ssw_df, wk_ssb_df)

16974 16970 4

#Checking work
if wk_sst_df == wk_ssw_df + wk_ssb_df:
    print("Your numbers make sense!")
else:
    print("Try again")

Your numbers make sense!

#Calculating msb
wk_msb = wk_ssb/wk_ssb_df
print(wk_msb)

0.9925478645066279

#Calculating mse
wk_mse = wk_ssw/wk_ssw_df
print(wk_mse)

0.15674685379986938

wk_F_stat = wk_msb/wk_mse
print("F-stat:", wk_F_stat)

F-stat: 6.332170888571002

#Checking work with pre-built function
results = stats.f_oneway(weekdays['Mon'], weekdays['Tues'], weekdays['Wed'], weekdays['Thurs'], weekdays['Fri'])
print(results)

F_onewayResult(statistic=6.3321708885709596, pvalue=4.355471395732717e-05)

#From F Table for alpha = 0.1
wk_ssw_df = 16970
wk_ssb_df = 4
wk_F_CV = 1.9449

#reject hypothesis if:
#F_CV < F_stat
wk_F_CV < wk_F_stat

True

Conclusion¶

We reject the null hypothsis and assume that there is at least one group mean that is statistically and significantly different from the others. In context, there is at least one day from Monday to Friday where significantly more or significantly fewer number people charge their Electric Vehicle.

Analysis of Variance (ANOVA) on Charging Time based on Facility Type¶

Null Hypothesis = The facility type where a charging station is located will NOT have an impact on charge time of users.

Alternate Hypothesis = The facility type where a charging station is located will have an impact on charge time of users.

#creating df for charge time and facility type 
df_fac = data2.loc[:,['chargeTimeHrs','facilityType']]
df_fac.head()

#pulling out information based on facility type
a = df_fac.loc[df_fac['facilityType'] == 'Office']
b = df_fac.loc[df_fac['facilityType'] == 'Manufacturing']
c = df_fac.loc[df_fac['facilityType'] == 'Research and Development']
d = df_fac.loc[df_fac['facilityType'] == 'Other']

#using built in stats function to calculate the f-statistic using charge time hrs by facility type
#each 'column' would contain the charge time hours by facility type 
results = stats.f_oneway(a['chargeTimeHrs'], b['chargeTimeHrs'], c['chargeTimeHrs'], d['chargeTimeHrs'])
print(results) #gives us F-Stat Calculated

F_onewayResult(statistic=42.12424294508296, pvalue=1.0134354022150005e-26)

#finding degrees of freedom 
m = 4
n = 3395
df_ssb = m-1
df_ssw = m*(n-1)
df_ssb, df_ssw

(3, 13576)

#Using table for alpha = .1 get F-CV
F_CV = 2.0838
F_Stat = 42.12424294508296

if F_CV < F_Stat: 
    print('Reject Null Hypothesis')
else:
    print('Accept Null Hypothesis')

Reject Null Hypothesis

Conclusion¶

Since the F-Critical Value is less than our calculated F-Statistic value , we would REJECT our null hypothesis, meaning that the facility type does have an impact on the amount of time people spend charging their cars. This can help us understand what type of facility we should set up our charging station as to get the highest usage of our station.

Take a look at the following graph to understand the relationship further:¶

facility_charge_avg = df_fac.groupby(["facilityType"])["chargeTimeHrs"].mean()
plt.figure(figsize=(7,4))
facility_charge_avg.plot(kind='bar', color = 'C0')
plt.title('Average Charging Duration at Different Facilities', fontweight='bold')
plt.xlabel('Facility Type')
plt.ylabel('Avg Charge Times Hours')
plt.xticks(rotation=45)
plt.show()

Chi Square Test to see if there is any relation between facility type and days of the week when it comes to charging station usage.¶

Null hypothesis = The facility type the charging station is located at and the day of the week people use the charging station are NOT related.

Alternate Hypothesis = The facility type the charging station is located at and the day of the week people use the charging station are related.

#creating dataframe containing weekday and facility type 
df_chi = data2.loc[:, ['facilityType', 'weekday']]
df_chi.head()

#gives me a series of the amount of charging station uses by facility type and weekday 
observed = df_chi.groupby(['facilityType', 'weekday']).size()
observed

facilityType              weekday
Manufacturing             Fri         99
                          Mon         97
                          Sat         52
                          Sun         22
                          Thu        105
                          Tue        107
                          Wed        111
Office                    Fri        143
                          Mon        170
                          Sat          1
                          Thu        190
                          Tue        164
                          Wed        194
Other                     Fri         25
                          Mon         23
                          Thu         22
                          Tue         19
                          Wed         19
Research and Development  Fri        343
                          Mon        326
                          Sat          9
                          Sun          2
                          Thu        418
                          Tue        345
                          Wed        389
dtype: int64

#making a list out of the values in the observed series 
observed_list = observed.values.tolist()

#adding in 0's for the days that there is no usage - office on sunday, other on sunday and saturday (adding 3 0's)
observed_list.insert(10, 0)
observed_list.insert(16, 0)
observed_list.insert(17, 0)

print(observed_list)

[99, 97, 52, 22, 105, 107, 111, 143, 170, 1, 0, 190, 164, 194, 25, 23, 0, 0, 22, 19, 19, 343, 326, 9, 2, 418, 345, 389]

#expected = row total * column total)/grand total (formula from in class assignment 5)
column_totals = totals
row_totals = observed.groupby(['weekday']).sum()
grand_total = 3395

column_totals

Research and Development    1832
Office                       862
Manufacturing                593
Other                        108
Name: facilityType, dtype: int64

row_totals

weekday
Fri    610
Mon    616
Sat     62
Sun     24
Thu    735
Tue    635
Wed    713
dtype: int64

#finding expected values by facility type 
manufacturing_val = list(((row_totals)*593)/grand_total)
office_val = list(((row_totals)*862)/grand_total)
other_val = list(((row_totals)*108)/grand_total)
r_val = list(((row_totals)*1832)/grand_total)

#joining all 4 lists together to get all the expected values 
expected_values = manufacturing_val + office_val + other_val + r_val
print(expected_values)

[106.54786450662739, 107.5958762886598, 10.829455081001473, 4.192047128129603, 128.38144329896906, 110.91458026509572, 124.53873343151693, 154.88070692194404, 156.40412371134022, 15.741973490427098, 6.093667157584683, 186.61855670103094, 161.2282768777614, 181.03269513991162, 19.40500736377025, 19.595876288659795, 1.9723122238586157, 0.7634756995581737, 23.38144329896907, 20.200294550810014, 22.68159057437408, 329.16642120765835, 332.4041237113402, 33.456259204712815, 12.95081001472754, 396.6185567010309, 342.65684830633285, 384.7469808541974]

#creating dataeframe with observed and expected lists 
values = {'observed': observed_list, 'expected': expected_values}
df_final = pd.DataFrame(data=values)

#chi square stat = (observed-expected)^2/expected.sum()
chi_squared_stat = (((df_final['observed'] - df_final['expected'])**2)/df_final['expected']).sum()
print(round(chi_squared_stat))

297.0

#df = (#rows-1)*(#columns-1); alpha = .1
#using built in function to get critical value 
crit = stats.chi2.ppf(q = 0.90, df = 18)
print(round(crit, 2))

25.99

Conclusion¶

Since our critical value(25.99) is significantly less than our calculated chi square stat (297.0), we can REJECT our null hypothesis and conclude that there is a relationship between the facility type the station is located at and the day of the week that people use the charging stations.

This helps us understand what user patterns we should expect depending on where we set up our station.

Take a look at the following graph to understand this relationship further:¶

from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
import plotly.graph_objs as go

#turning observed series into a dataframe
observed_df = pd.DataFrame(data=observed).reset_index()

#adding rows for 0 values - if I don't do this, the graph will mess up the order of the values 
observed_df.loc[9.5] = 'Office', 'Sun', 0
observed_df_new = observed_df.sort_index().reset_index(drop=True)

observed_df_new.loc[15.5] = 'Other', 'Sat', 0
observed_df_new = observed_df_new.sort_index().reset_index(drop=True)

observed_df_new.loc[16.5] = 'Other', 'Sun', 0
observed_df_new = observed_df_new = observed_df_new.sort_index().reset_index(drop=True)

#making lists for graph
f_types = ['Manufacturing', 'Office','Other', 'Research and Development']
weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

#looping through weekdays so each bar represents the number of uses on each day 
fig = go.Figure()
for weekday in weekdays:
    fig.add_trace(go.Bar(
        x=f_types,
        y=observed_df_new[0].loc[observed_df_new['weekday'] == weekday],
        name= weekday
    ))
    
fig.update_layout(barmode='group', xaxis_tickangle=-45, title_text='Usage by Facility Type and Day of the Week',
                 height = 500, width = 700)
fig.show()

Logistic Regression Analysis¶

#Importing Libraries
import seaborn as sns
from statsmodels.formula.api import glm
from statsmodels.formula.api import ols
import statsmodels.api as sm

data1 = df[['kwhTotal','dollars', 'chargeTimeHrs', 'distance','weekday','platform','facilityType']]
data1 = data1.dropna()

data1 = data1.replace(0, np.nan)
#then drop nan values again
data1 = data1.dropna()
print(data1)

      kwhTotal  dollars  chargeTimeHrs   distance weekday platform  \
348       6.96     0.67       4.738333  20.695727     Tue      ios   
358       5.06     3.08       7.072778  20.695727     Thu      ios   
400       6.76     0.50       4.591389  24.856621     Tue  android   
406       6.78     0.83       4.896944  24.856621     Wed  android   
411       6.93     0.50       4.260278  24.856621     Wed  android   
...        ...      ...            ...        ...     ...      ...   
3314      5.11     0.75       4.808056  27.096147     Wed  android   
3364      6.67     0.50       4.161389   3.177218     Thu  android   
3367      6.70     1.75       5.812222   3.177218     Wed  android   
3368      7.07     0.50       4.112222   2.889510     Wed      ios   
3383      6.94     0.50       4.366111   4.127339     Fri      ios   

      facilityType  
348              3  
358              3  
400              1  
406              1  
411              1  
...            ...  
3314             2  
3364             2  
3367             2  
3368             2  
3383             3  

[220 rows x 7 columns]

modelDollarsDistance = ols(formula='dollars ~ distance', data=data1)
modelDollarsDistanceFit = modelDollarsDistance.fit()
modelpredictions = pd.DataFrame( columns=['actual_dollars'], data= data1) 
modelpredictions['actual_distance'] = data1['distance']
print( modelDollarsDistanceFit.summary() )

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                dollars   R-squared:                       0.038
Model:                            OLS   Adj. R-squared:                  0.034
Method:                 Least Squares   F-statistic:                     8.707
Date:                Mon, 30 Nov 2020   Prob (F-statistic):            0.00352
Time:                        14:45:37   Log-Likelihood:                -346.05
No. Observations:                 220   AIC:                             696.1
Df Residuals:                     218   BIC:                             702.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.5384      0.162      9.497      0.000       1.219       1.858
distance      -0.0211      0.007     -2.951      0.004      -0.035      -0.007
==============================================================================
Omnibus:                      126.954   Durbin-Watson:                   2.010
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              576.852
Skew:                           2.423   Prob(JB):                    5.47e-126
Kurtosis:                       9.281   Cond. No.                         46.6
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

modelDollarsWeekday = ols(formula='dollars ~ C(weekday)', data=data1)
modelDollarsWeekdayFit = modelDollarsWeekday.fit()
modelpredictions['dollars_weekday'] = modelDollarsWeekdayFit.predict(data1)
modelpredictions['actual_weekday'] = data1['weekday']
print( modelDollarsWeekdayFit.summary() )

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                dollars   R-squared:                       0.032
Model:                            OLS   Adj. R-squared:                  0.009
Method:                 Least Squares   F-statistic:                     1.397
Date:                Mon, 30 Nov 2020   Prob (F-statistic):              0.226
Time:                        14:45:37   Log-Likelihood:                -346.82
No. Observations:                 220   AIC:                             705.6
Df Residuals:                     214   BIC:                             726.0
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             0.9107      0.183      4.973      0.000       0.550       1.272
C(weekday)[T.Mon]     0.1993      0.268      0.745      0.457      -0.328       0.727
C(weekday)[T.Sat]    -0.4107      1.201     -0.342      0.733      -2.778       1.957
C(weekday)[T.Thu]     0.1479      0.247      0.598      0.550      -0.340       0.635
C(weekday)[T.Tue]     0.0490      0.272      0.180      0.857      -0.486       0.584
C(weekday)[T.Wed]     0.5569      0.244      2.280      0.024       0.076       1.038
==============================================================================
Omnibus:                      132.646   Durbin-Watson:                   1.955
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              650.971
Skew:                           2.518   Prob(JB):                    4.40e-142
Kurtosis:                       9.756   Cond. No.                         16.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

modelDollarsFacility = ols(formula='dollars ~ C(facilityType)', data=data1)
modelDollarsFacilityFit = modelDollarsFacility.fit()
#modelpredictions['dollars_facility'] = modelDollarsFacilityFit.predict(data1)
#modelpredictions['actual_facilityType'] = data1['facilityType']
modelpredictions['actual_dollars'] = data1['dollars']
print( modelDollarsFacilityFit.summary() )

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                dollars   R-squared:                       0.006
Model:                            OLS   Adj. R-squared:                 -0.008
Method:                 Least Squares   F-statistic:                    0.4335
Date:                Mon, 30 Nov 2020   Prob (F-statistic):              0.729
Time:                        14:45:37   Log-Likelihood:                -349.70
No. Observations:                 220   AIC:                             707.4
Df Residuals:                     216   BIC:                             721.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                1.0245      0.268      3.828      0.000       0.497       1.552
C(facilityType)[T.2]     0.2123      0.343      0.618      0.537      -0.464       0.889
C(facilityType)[T.3]     0.0753      0.283      0.266      0.791      -0.483       0.634
C(facilityType)[T.4]     0.7255      0.741      0.979      0.329      -0.735       2.186
==============================================================================
Omnibus:                      140.668   Durbin-Watson:                   1.947
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              781.854
Skew:                           2.646   Prob(JB):                    1.67e-170
Kurtosis:                      10.569   Cond. No.                         12.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

modelDollarsChargeTime = ols(formula='dollars ~ chargeTimeHrs', data=data1)
modelDollarsChargeTimeFit = modelDollarsChargeTime.fit()
modelpredictions['dollars_chargeTimeHrs'] = modelDollarsChargeTimeFit.predict(data1)
modelpredictions['actual_chargeTimeHrs'] = data1['chargeTimeHrs']
print( modelDollarsChargeTimeFit.summary() )

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                dollars   R-squared:                       0.970
Model:                            OLS   Adj. R-squared:                  0.970
Method:                 Least Squares   F-statistic:                     7112.
Date:                Mon, 30 Nov 2020   Prob (F-statistic):          2.16e-168
Time:                        14:45:38   Log-Likelihood:                 36.317
No. Observations:                 220   AIC:                            -68.63
Df Residuals:                     218   BIC:                            -61.85
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        -3.4735      0.056    -61.777      0.000      -3.584      -3.363
chargeTimeHrs     0.9192      0.011     84.333      0.000       0.898       0.941
==============================================================================
Omnibus:                      107.001   Durbin-Watson:                   1.803
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1474.527
Skew:                           1.489   Prob(JB):                         0.00
Kurtosis:                      15.328   Cond. No.                         21.6
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

modelDollarsKWH = ols(formula='dollars ~ kwhTotal', data=data1)
modelDollarsKWHFit = modelDollarsKWH.fit()
#modelpredictions['dollars_kwhTotal'] = modelDollarsKWHFit.predict(data1)
#modelpredictions['actual_kwhTotal'] = data1['kwhTotal']
print( modelDollarsKWHFit.summary() )

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                dollars   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                 -0.003
Method:                 Least Squares   F-statistic:                    0.4187
Date:                Mon, 30 Nov 2020   Prob (F-statistic):              0.518
Time:                        14:45:38   Log-Likelihood:                -350.15
No. Observations:                 220   AIC:                             704.3
Df Residuals:                     218   BIC:                             711.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.9932      0.213      4.653      0.000       0.573       1.414
kwhTotal       0.0192      0.030      0.647      0.518      -0.039       0.078
==============================================================================
Omnibus:                      141.051   Durbin-Watson:                   1.945
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              776.564
Skew:                           2.662   Prob(JB):                    2.35e-169
Kurtosis:                      10.509   Cond. No.                         19.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

modelDollarsPlatform = ols(formula='dollars ~ C(platform)', data=data1)
modelDollarsPlatformFit = modelDollarsPlatform.fit()
#modelpredictions['dollars_platform'] = modelDollarsPlatformFit.predict(data1)
#modelpredictions['actual_platform'] = data1['platform']
print( modelDollarsPlatformFit.summary() )
print( modelpredictions)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                dollars   R-squared:                       0.016
Model:                            OLS   Adj. R-squared:                  0.011
Method:                 Least Squares   F-statistic:                     3.455
Date:                Mon, 30 Nov 2020   Prob (F-statistic):             0.0644
Time:                        14:45:38   Log-Likelihood:                -348.63
No. Observations:                 220   AIC:                             701.3
Df Residuals:                     218   BIC:                             708.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.8152      0.183      4.456      0.000       0.455       1.176
C(platform)[T.ios]     0.3781      0.203      1.859      0.064      -0.023       0.779
==============================================================================
Omnibus:                      137.882   Durbin-Watson:                   1.973
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              734.109
Skew:                           2.601   Prob(JB):                    3.89e-160
Kurtosis:                      10.282   Cond. No.                         4.37
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
      actual_dollars  actual_distance  dollars_weekday actual_weekday  \
348             0.67        20.695727         0.959714            Tue   
358             3.08        20.695727         1.058627            Thu   
400             0.50        24.856621         0.959714            Tue   
406             0.83        24.856621         1.467593            Wed   
411             0.50        24.856621         1.467593            Wed   
...              ...              ...              ...            ...   
3314            0.75        27.096147         1.467593            Wed   
3364            0.50         3.177218         1.058627            Thu   
3367            1.75         3.177218         1.467593            Wed   
3368            0.50         2.889510         1.467593            Wed   
3383            0.50         4.127339         0.910714            Fri   

      dollars_chargeTimeHrs  actual_chargeTimeHrs  
348                0.881822              4.738333  
358                3.027571              7.072778  
400                0.746755              4.591389  
406                1.027612              4.896944  
411                0.442408              4.260278  
...                     ...                   ...  
3314               0.945908              4.808056  
3364               0.351512              4.161389  
3367               1.868907              5.812222  
3368               0.306320              4.112222  
3383               0.539687              4.366111  

[220 rows x 6 columns]

	sessionId	kwhTotal	dollars	startTime	endTime	chargeTimeHrs	distance	userId	stationId	locationId	managerVehicle	facilityType	Mon	Tues	Wed	Thurs	Fri	Sat	Sun	reportedZip
count	3.395000e+03	3395.000000	3395.000000	3395.000000	3395.000000	3395.000000	2330.000000	3.395000e+03	3395.000000	3395.000000	3395.000000	3395.000000	3395.000000	3395.000000	3395.000000	3395.000000	3395.000000	3395.000000	3395.000000	3395.000000
mean	5.487001e+06	5.809629	0.118268	13.743446	16.455965	2.841488	18.652378	5.742395e+07	576789.678056	629934.460677	0.595582	2.428571	0.181443	0.187040	0.210015	0.216495	0.179676	0.018262	0.007069	0.703976
std	2.590657e+06	2.892727	0.492562	3.204370	3.406732	1.507472	11.420571	2.674772e+07	257486.310402	255620.993849	0.490851	0.811204	0.385442	0.390001	0.407379	0.411916	0.383974	0.133918	0.083793	0.456569
min	1.004821e+06	0.000000	0.000000	0.000000	0.000000	0.012500	0.856911	1.042767e+07	129465.000000	125372.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	3.234666e+06	4.350000	0.000000	11.000000	14.000000	2.110278	5.135871	3.329548e+07	369001.000000	481066.000000	0.000000	2.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	5.451498e+06	6.230000	0.000000	13.000000	16.000000	2.808889	21.023826	4.924181e+07	549414.000000	503205.000000	1.000000	3.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
75%	7.746644e+06	6.830000	0.000000	17.000000	20.000000	3.544167	27.285053	8.188052e+07	864630.000000	878393.000000	1.000000	3.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
max	9.998981e+06	23.680000	7.500000	23.000000	23.000000	55.238056	43.059292	9.834581e+07	995505.000000	978130.000000	1.000000	4.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

	kwhTotal	dollars	chargeTimeHrs	distance
kwhTotal	1.000000	0.111073	0.446380	0.241827
dollars	0.111073	1.000000	0.654730	-0.021518
chargeTimeHrs	0.446380	0.654730	1.000000	0.074845
distance	0.241827	-0.021518	0.074845	1.000000

	kwhTotal	dollars	chargeTimeHrs	distance
kwhTotal	1.000000	0.043783	0.048569	0.098329
dollars	0.043783	1.000000	0.985017	-0.195975
chargeTimeHrs	0.048569	0.985017	1.000000	-0.189963
distance	0.098329	-0.195975	-0.189963	1.000000

	sessionId	kwhTotal	dollars	created	ended	startTime	endTime	chargeTimeHrs	weekday	platform	distance	userId	stationId	locationId	facilityType	Tues	Wed	Thurs	Fri
0	1366563	7.78	0.00	0014-11-18 15:40:26	0014-11-18 17:11:04	15	17	1.510556	Tue	android	NaN	35897499	582873	461655	3	1	0	0	0
1	3075723	9.74	0.00	0014-11-19 17:40:26	0014-11-19 19:51:04	17	19	2.177222	Wed	android	NaN	35897499	549414	461655	3	0	1	0	0
2	4228788	6.76	0.58	0014-11-21 12:05:46	0014-11-21 16:46:04	12	16	4.671667	Fri	android	NaN	35897499	129465	461655	3	0	0	0	1
3	3173284	6.17	0.00	0014-12-03 19:16:12	0014-12-03 21:02:18	19	21	1.768333	Wed	android	NaN	35897499	569889	461655	3	0	1	0	0
4	3266500	0.93	0.00	0014-12-11 20:56:11	0014-12-11 21:14:06	20	21	0.298611	Thu	android	NaN	35897499	414088	566549	3	0	0	1	0

	sessionId	kwhTotal	created	ended	startTime	endTime	chargeTimeHrs	weekday	platform	distance	userId	stationId	locationId	managerVehicle	facilityType	Wed	Thurs	Fri	reportedZip
3390	7021565	6.74	0015-10-01 16:31:18	0015-10-01 19:59:08	16	19	3.463889	Thu	ios	13.352643	59574735	884707	648339	0	2	0	1	0	1
3391	3758092	6.86	0015-10-02 16:28:48	0015-10-02 19:27:05	16	19	2.971389	Fri	ios	13.352643	59574735	884707	648339	0	2	0	0	1	1
3392	5858374	6.07	0015-09-30 16:54:22	0015-09-30 20:24:06	16	20	3.495556	Wed	android	2.337085	32070852	638536	868085	0	3	1	0	0	1
3393	2586645	5.74	0015-09-24 11:43:02	0015-09-24 13:55:12	11	13	2.202778	Thu	ios	4.671064	58023207	818217	700367	1	2	0	1	0	1
3394	7860608	6.95	0015-10-01 16:43:05	0015-10-01 19:42:06	16	19	2.983611	Thu	ios	3.308334	26098875	664306	868085	0	3	0	1	0	1

	weekday	chargeTimeHrs
0	Fri	1685.592500
1	Mon	1785.846111
2	Thu	2050.110278
3	Tue	1838.203611
4	Wed	2094.245833

	chargeTimeHrs	facilityType
0	1.510556	Research and Development
1	2.177222	Research and Development
2	4.671667	Research and Development
3	1.768333	Research and Development
4	0.298611	Research and Development

Electric Car Charging Station Analysis¶

Group 4: Harish Ram, Kristin Levine, Sabina Azim¶

Background¶

SMART Question¶

About the Dataset¶

Limitations of the Dataset¶

How was the information gathered?¶

What analysis has already been completed related to the content in your dataset?¶

What additional information would be beneficial?¶

Exploratory Data Analysis¶

How did your question change, if at all, after Exploratory Data Analysis?¶

How did you select and determine the correct models to answer your questions?¶

Predictions From our Models¶

Z-Test: How far will people drive to reach a charging station?¶

Z-Test: How long does a charge take?¶

Z-Test: How many kwh are used per session?¶

Correlation¶

Logistic Regression¶

ANOVA - Charging and Day of the Week - All Days¶

ANOVA - Charging and Day of the Week - Weekdays Only¶

ANOVA - Facility Type and Charge Time¶

Chi-Square Test - Facility Type and Day of the Week¶

How reliable are your results?¶

What additional information might improve your results?¶

Dataset:¶

Background Info:¶

Appendix¶

Exploratory Data Analysis (EDA)¶

Platform Used to View Charging Status¶

Analysis:¶

How often charging stations are used during the week:¶

Analysis:¶

Looking at charging times:¶

Specific Users¶

Facility Types¶

Z-Statistics: Distance from Charging Station¶

Z-Statistics: Charging Duration¶

Z-Statistics: Total Killowatt-Hours (kWh) Used During Charging¶

Correlations¶

Analysis of Variance (ANOVA) on Charging During the Week¶

Hypothesis:¶

Conclusion:¶

Analysis of Variance (ANOVA) on Charging During only Weekdays¶

Using ANOVA¶

H0: The columns are not that different, i.e. no variation in the means of groups¶

H1: At least one group mean is different from the others.¶

Conclusion¶

Analysis of Variance (ANOVA) on Charging Time based on Facility Type¶

Conclusion¶

Take a look at the following graph to understand the relationship further:¶

Chi Square Test to see if there is any relation between facility type and days of the week when it comes to charging station usage.¶

Conclusion¶

Take a look at the following graph to understand this relationship further:¶

Logistic Regression Analysis¶