Introduction


This is an Exploratory Data Analysis of crime in Vancouver.

This analysis was conducted using Python with Jupyter Notebook. This Notebook contains all the coding so that it can be replicated. It also contains coding comments, which some are basic comments that are meant to help others who might be learning Python.

The Data set

The data comes from the Vancouver Open Data Catalogue.

It was extracted on 2017-07-18 and it contains 530,652 records from 2003-01-01 to 2017-07-13.

The original data set contains coordinates in UTM Zone 10. To convert the coordinates to Latitude and Longitude, I used a spreadsheet that can be found here.

If you would like to use the same data, you can download it here.

Data Quality

Please note that the data set does NOT include all crimes. For instance, the Vancouver Police Department 2016 Annual Report lists a total of 55,145 incidents for 2016. This data set has a total of 37,798 incidents for 2016, which corresponds to 69% of the official number. Also, this data set might change over time. For more details, see the Vancouver Open Data Catalogue description.


Importing the Data Analysis and Visualization packages


In [1]:
# Import data manipulation packages
import numpy as np
import pandas as pd

# Import data visualization packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Ignore warning (Optional)
import warnings
warnings.filterwarnings('ignore')

Processing and Transforming the data


In [2]:
# Import CSV file as a pandas df (Data Frame)
df = pd.read_csv('crime_csv_all_years.csv')
In [3]:
# Take a look at the first entries
df.head()
Out[3]:
TYPE YEAR MONTH DAY HOUR MINUTE HUNDRED_BLOCK NEIGHBOURHOOD X Y
0 Other Theft 2003 5 12 16.0 15.0 9XX TERMINAL AVE Strathcona 493906.5 5457452.47
1 Other Theft 2003 5 7 15.0 20.0 9XX TERMINAL AVE Strathcona 493906.5 5457452.47
2 Other Theft 2003 4 23 16.0 40.0 9XX TERMINAL AVE Strathcona 493906.5 5457452.47
3 Other Theft 2003 4 20 11.0 15.0 9XX TERMINAL AVE Strathcona 493906.5 5457452.47
4 Other Theft 2003 4 12 17.0 45.0 9XX TERMINAL AVE Strathcona 493906.5 5457452.47

Remove unnecessary columns

Column MINUTE can be deleted as we don't need to go to the minute level. Columns regarding location (X, Y, Neighbourhood, Hundred block) won't be used in this analysis, but they will be used in the Tableau dashboard, so let's keep them for now.

In [5]:
# Dropping column. Use axis = 1 to indicate columns and inplace = True to 'commit' the transaction
df.drop(['MINUTE'], axis = 1, inplace=True)

Missing Values

In [6]:
# Let's take a look into our data to check for missing values and data types
df.info()

RangeIndex: 530652 entries, 0 to 530651
Data columns (total 9 columns):
TYPE             530652 non-null object
YEAR             530652 non-null int64
MONTH            530652 non-null int64
DAY              530652 non-null int64
HOUR             476290 non-null float64
HUNDRED_BLOCK    530639 non-null object
NEIGHBOURHOOD    474028 non-null object
X                530652 non-null float64
Y                530652 non-null float64
dtypes: float64(3), int64(3), object(3)
memory usage: 36.4+ MB

We can see that we have 530,652 entries, but some columns (HOUR, HUNDRED_BLOCK and NEIGHBOURHOOD) have less, which means that there are missing values. Let's fill them.

In [7]:
# As HOUR is a float data type, I'm filling with a dummy value of '99'. For others, filling with 'N/A'
df['HOUR'].fillna(99, inplace = True)
df['NEIGHBOURHOOD'].fillna('N/A', inplace = True)
df['HUNDRED_BLOCK'].fillna('N/A', inplace = True)

Transforming the DATE column

The date is separated in different columns (YEAR, MONTH, DAY) , let's combine them into a single column and add it as a new column called 'DATE'

In [8]:
# Use pandas function to_datetime to convert it to a datetime data type
df['DATE'] = pd.to_datetime({'year':df['YEAR'], 'month':df['MONTH'], 'day':df['DAY']})

It might be useful to have the day of the week...

In [9]:
# Let's use padas dt.dayofweek (Monday=0 to Sunday=6) and add it as a column called 'DAY_OF_WEEK'
df['DAY_OF_WEEK'] = df['DATE'].dt.dayofweek

When working with time series data, using the date as the index is helpful.

In [10]:
# Change the index to the colum 'DATE'
df.index = pd.DatetimeIndex(df['DATE'])

This dataset was extracted in 2017-07-18 and it contains partial data for this month. I'm excluding them to keep full months only.

In [11]:
# Filtering the data to exclude month of July 2017
df = df[df['DATE'] < '2017-07-01']
In [12]:
# Let's check the data again
df.info()

DatetimeIndex: 529403 entries, 2003-05-12 to 2017-06-06
Data columns (total 11 columns):
TYPE             529403 non-null object
YEAR             529403 non-null int64
MONTH            529403 non-null int64
DAY              529403 non-null int64
HOUR             529403 non-null float64
HUNDRED_BLOCK    529403 non-null object
NEIGHBOURHOOD    529403 non-null object
X                529403 non-null float64
Y                529403 non-null float64
DATE             529403 non-null datetime64[ns]
DAY_OF_WEEK      529403 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(4), object(3)
memory usage: 48.5+ MB

Creating Categories

Let's see the type of crimes that we have in our data and categorize them.

In [13]:
# Using pandas value_counts function to aggregate types
df['TYPE'].value_counts().sort_index()
Out[13]:
Break and Enter Commercial                                 33787
Break and Enter Residential/Other                          60768
Homicide                                                     219
Mischief                                                   70253
Offence Against a Person                                   54035
Other Theft                                                51995
Theft from Vehicle                                        172265
Theft of Bicycle                                           25597
Theft of Vehicle                                           38359
Vehicle Collision or Pedestrian Struck (with Fatality)       254
Vehicle Collision or Pedestrian Struck (with Injury)       21871
Name: TYPE, dtype: int64

From the types above, I'm creating the following categories: Break and Enter, Theft, Vehicle Collision, Others

In [14]:
# Create a function to categorize types, using an 'if' statement.
def category(crime_type):
    if 'Theft' in crime_type:
        return 'Theft'
    elif 'Break' in crime_type:
        return 'Break and Enter'
    elif 'Collision' in crime_type:
        return 'Vehicle Collision'
    else:
        return 'Others'
In [15]:
# Apply the function and add it as CATEGORY column
df['CATEGORY'] = df['TYPE'].apply(category)

Although 'vehicle collision' is included in this data, I'll separate it apart because I believe it is a different kind of crime.

In [16]:
vehicle_collision = df[df['CATEGORY'] == 'Vehicle Collision']
crimes = df[df['CATEGORY'] != 'Vehicle Collision']

Now we have our data frame named crimes ready to analyze.

Exploratory Data Analysis


What's the distribution of crimes per day?

Let's start with a histogram of crimes per day.

In [17]:
# Using resample('D') to group it by day and size() to return the count
plt.figure(figsize=(15,6))
plt.title('Distribution of Crimes per day', fontsize=16)
plt.tick_params(labelsize=14)
sns.distplot(crimes.resample('D').size(), bins=60);
  • We can see that the distribution looks like a normal distribution with a mean around 95 crimes per day.

  • There is one outlier over 600. Let's find out which day it is.

Outlier

In [18]:
# Using idxmax() to find out the index of the max value
crimes.resample('D').size().idxmax()
Out[18]:
Timestamp('2011-06-15 00:00:00', freq='D')

So the day was 2011-06-15.

Let's make a time series graph with crimes per day.

In [19]:
# Create a Upper Control Limit (UCL) and a Lower Control Limit (LCL) without the outlier
crimes_daily = pd.DataFrame(crimes[crimes['DATE'] != '2011-06-15'].resample('D').size())
crimes_daily['MEAN'] = crimes[crimes['DATE'] != '2011-06-15'].resample('D').size().mean()
crimes_daily['STD'] = crimes[crimes['DATE'] != '2011-06-15'].resample('D').size().std()
UCL = crimes_daily['MEAN'] + 3 * crimes_daily['STD']
LCL = crimes_daily['MEAN'] - 3 * crimes_daily['STD']

# Plot Total crimes per day, UCL, LCL, Moving-average
plt.figure(figsize=(15,6))
crimes.resample('D').size().plot(label='Crimes per day')
UCL.plot(color='red', ls='--', linewidth=1.5, label='UCL')
LCL.plot(color='red', ls='--', linewidth=1.5, label='LCL')
crimes_daily['MEAN'].plot(color='red', linewidth=2, label='Average')
plt.title('Total crimes per day', fontsize=16)
plt.xlabel('Day')
plt.ylabel('Number of crimes')
plt.tick_params(labelsize=14)
plt.legend(prop={'size':16});

We can see some days over the Control Limits, indicating signals. Also, the period of 2003 to 2008 is above the average. Maybe a different Control Limit could be done for that period, but that's ok for now.

Let's focus on the day 2011-06-15 which is way above. Is that an error on the data?

Let's drill down and find out...

In [20]:
# Find out how many crimes by getting the length
len(crimes['2011-06-15'])
Out[20]:
647
In [21]:
# Check how many crimes per type
crimes['2011-06-15']['TYPE'].value_counts().head(5)
Out[21]:
Mischief                      367
Break and Enter Commercial    174
Offence Against a Person       35
Theft from Vehicle             31
Theft of Bicycle               13
Name: TYPE, dtype: int64
In [22]:
# Check how many crimes per neighbourhood
crimes['2011-06-15']['NEIGHBOURHOOD'].value_counts().head(5)
Out[22]:
Central Business District    534
N/A                           38
Mount Pleasant                13
West End                      13
Strathcona                     9
Name: NEIGHBOURHOOD, dtype: int64
In [23]:
# Check how many crimes per hour
crimes['2011-06-15']['HOUR'].value_counts().head(5)
Out[23]:
20.0    159
21.0    132
22.0    108
19.0     48
99.0     35
Name: HOUR, dtype: int64

There are 647 occurrences, mostly mischief type, in the Central Business District, around the same hour but no exactly. They don't really seem to be duplicated entries.

The Stanley Cup Riot

I moved to Canada in 2016 and I had no idea if something happened that day, so a Google search showed me it was the Stanley Cup's final game, Boston Bruins vs Vancouver Canucks. There were 155,000 people watching it in the downtown area. Before the game was over, as the Vancouver Canucks was about to loose it, a big riot started. It seems that it was ugly... If you want to know more about it, there is a 21-pages report by the BC Ministry of Justice.

So that day wasn't an error on the data and it showed me something that happened in Vancouver that I wasn't expecting at all. Interesting...

Which days have the highest and lowest average number crimes?

In [24]:
# Create a pivot table with day and month; another that counts the number of years that each day had; and the average. 
crimes_pivot_table = crimes[(crimes['DATE'] != '2011-06-15')].pivot_table(values='YEAR', index='DAY', columns='MONTH', aggfunc=len)
crimes_pivot_table_year_count = crimes[(crimes['DATE'] != '2011-06-15')].pivot_table(values='YEAR', index='DAY', columns='MONTH', aggfunc=lambda x: len(x.unique()))
crimes_average = crimes_pivot_table/crimes_pivot_table_year_count
crimes_average.columns = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

# Using seaborn heatmap
plt.figure(figsize=(7,9))
plt.title('Average Number of Crime per Day and Month', fontsize=14)
sns.heatmap(crimes_average.round(), cmap='seismic', linecolor='grey',linewidths=0.1, cbar=False, annot=True, fmt=".0f");

Blue means good days. Red bad days. White average days.

  • The calmest day of crime is Christmas Day, December 25 (30% below average). Criminals also celebrate with families...
  • The worst day is New Year's Day, January 1 (30% above average). And then after Christmas celebration, criminals take advantage of drunk people at parties?
  • The first day of the month is a busy day for all months.
  • Halloween (October 30,31 and November 1) are also dangerous days.
  • The second week of summer months are usually the most dangerous.
  • BC Day (August 7) long weekend have high averages.

Is crime descreasing or increasing?

Now let's plot the number of crimes per month and a moving average:

In [25]:
# Using resample 'M' and rolling window 12
plt.figure(figsize=(15,6))
crimes.resample('M').size().plot(label='Total per month')
crimes.resample('M').size().rolling(window=12).mean().plot(color='red', linewidth=5, label='12-months Moving Average')

plt.title('Crimes per month', fontsize=16)
plt.xlabel('')
plt.legend(prop={'size':16})
plt.tick_params(labelsize=16);
  • Average number of crimes per month decreased from 4,000 crimes per month to around 2,400 in the period of 2003 to 2011. That's really good!
    • Vancouver hosted the 2010 Winter Olympics and the election for host city happened in 2003. So maybe the decrease of crimes is related to this event?
  • From 2011 to 2014, the moving average was around the same.
  • From 2014, the average has increased and 2016 reached similar leves of 2008.

Is this trend the same for all categories?

Let's redo the plot but with categories.

In [26]:
# Using pivot_table to groub by date and category, resample 'M' and rolling window 12
crimes.pivot_table(values='TYPE', index='DATE', columns='CATEGORY', aggfunc=len).resample('M').sum().rolling(window=12).mean().plot(figsize=(15,6), linewidth=4)
plt.title('Moving Average of Crimes per month by Category', fontsize=16)
plt.xlabel('')
plt.legend(prop={'size':16})
plt.tick_params(labelsize=16);
  • Theft is the major category. The decrease and increase that we saw in the average number of crimes per month was mainly because of the variation in this category.

Is there any trend within the year?

Let's make a heatmap with months and categories

In [27]:
# Create a pivot table with month and category. 
crimes_pivot_table = crimes.pivot_table(values='TYPE', index='CATEGORY', columns='MONTH', aggfunc=len)

# To compare categories, I'm scaling each category by diving by the max value of each one
crimes_scaled = pd.DataFrame(crimes_pivot_table.iloc[0] / crimes_pivot_table.iloc[0].max())

# Using a for loop to scale others
for i in [2,1]:
    crimes_scaled[crimes_pivot_table.index[i]] =  pd.DataFrame(crimes_pivot_table.iloc[i] / crimes_pivot_table.iloc[i].max())
                    
# Using seaborn heatmap
plt.figure(figsize=(4,4))
plt.title('Month and Category heatmap', fontsize=14)
plt.tick_params(labelsize=12)
sns.heatmap(crimes_scaled, cmap='seismic', linecolor='grey',linewidths=0.1, cbar=False);
  • Break and Enter has most incidents in January.
  • Theft and Others are more frequent during summer months. Maybe because people go out more and there is an increased number of tourists?
  • December is not a "good month" for Theft and Others. Or are thieves busy with Breaking and Entering people's home while they are travelling?

What about Day of the Week?

In [28]:
# Create a pivot table with day of the week and category. 
crimes_pivot_table = crimes.pivot_table(values='TYPE', index='CATEGORY', columns='DAY_OF_WEEK', aggfunc=len)

# To compare categories, I'm scaling each category by diving by the max value of each one
crimes_scaled = pd.DataFrame(crimes_pivot_table.iloc[0] / crimes_pivot_table.iloc[0].max())

# Using a for loop to scale row
for i in [2,1]:
    crimes_scaled[crimes_pivot_table.index[i]] = crimes_pivot_table.iloc[i] / crimes_pivot_table.iloc[i].max()
                    
crimes_scaled.index = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

# Using seaborn heatmap
plt.figure(figsize=(4,4))
plt.title('Day of the Week and Category heatmap', fontsize=14)
plt.tick_params(labelsize=12)
sns.heatmap(crimes_scaled, cmap='seismic', linecolor='grey',linewidths=0.1, cbar=False);
  • Break and Enter is more frequent on week days.
  • Theft and Others on weekends.

What hours do crimes happen?

In [29]:
# Create a pivot table with hour and category. 
crimes_pivot_table = crimes.pivot_table(values='TYPE', index='CATEGORY', columns='HOUR', aggfunc=len)

# To compare categories, I'm scaling each category by diving by the max value of each one
crimes_scaled = pd.DataFrame(crimes_pivot_table.iloc[0] / crimes_pivot_table.iloc[0].max())

# Using a for loop to scale row
for i in [2,1]:
    crimes_scaled[crimes_pivot_table.index[i]] =  pd.DataFrame(crimes_pivot_table.iloc[i] / crimes_pivot_table.iloc[i].max())
                    
# Using seaborn heatmap
plt.figure(figsize=(5,5))
plt.title('Hour and Category heatmap', fontsize=14)
plt.tick_params(labelsize=12)
sns.heatmap(crimes_scaled, cmap='seismic', linecolor='grey',linewidths=0.1, cbar=False);
  • Most crimes happen between 17:00 and 20:00.
  • Break and Enter has some activity between 3 and 5 am.
  • Theft doesn't occur much in those hours between 3 and 5 as expected, because most people are sleeping.
  • Category Others doesn't have Hour in most of the entries (they are 'offset to protect privacy').

Do crimes happen in the same hour for each day of the week?

In [30]:
# Create a pivot table with hour and day of week. 
crimes_pivot_table = crimes[crimes['HOUR'] != 99].pivot_table(values='TYPE', index='DAY_OF_WEEK', columns='HOUR', aggfunc=len)

# To compare categories, I'm scaling each category by diving by the max value of each one
crimes_scaled = pd.DataFrame(crimes_pivot_table.loc[0] / crimes_pivot_table.loc[0].max())

# Using a for loop to scale each day
for i in [1,2,3,4,5,6]:
    crimes_scaled[i] = crimes_pivot_table.loc[i] / crimes_pivot_table.loc[i].max()

# Rename days of week
crimes_scaled.columns = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

# Using seaborn heatmap
plt.figure(figsize=(6,6))
plt.title('Hour and Day of the Week heatmap', fontsize=14)
plt.tick_params(labelsize=12)
sns.heatmap(crimes_scaled, cmap='seismic', linecolor='grey',linewidths=0.1, cbar=False);

Exploring Category "Theft"

We saw before that most crimes are in the category "Theft". Let's explore it better

In [31]:
# Let's check what types of theft we have and how many
crimes[crimes['CATEGORY'] == 'Theft']['TYPE'].value_counts()
Out[31]:
Theft from Vehicle    172265
Other Theft            51995
Theft of Vehicle       38359
Theft of Bicycle       25597
Name: TYPE, dtype: int64

Theft from Vehicle is the major type. Let's view each type over time.

In [32]:
# Initiate the figure and define size
plt.figure(1)
plt.figure(figsize=(15,8))

# Using a for loop to plot each type of crime with a moving average
i = 221
for crime_type in crimes[crimes['CATEGORY'] == 'Theft']['TYPE'].unique():    
    plt.subplot(i);
    crimes[crimes['TYPE'] == crime_type].resample('M').size().plot(label='Total per month')
    crimes[crimes['TYPE'] == crime_type].resample('M').size().rolling(window=12).mean().plot(color='red', linewidth=5, label='12-months Moving Average')
    plt.title(crime_type, fontsize=14)
    plt.xlabel('')
    plt.legend(prop={'size':12})
    plt.tick_params(labelsize=12)
    i = i + 1
  • First, let me start with "Theft of Vehicle:
    • It had a major decrease, from an average of around 520 crimes per month in 2003 to around 100 in 2012. That's impressive!
    • Although the average has been increasing in the past years, it's way below 2003.
    • In 2002, the "Bait Car" program was launched and in 2003 the IMPACT group was formed in response to this peak in theft. It looks like they've been doing a great job!
    • Side note: I wonder if this decrease of around 80% in the number of theft had any impact on insurance policies prices...
  • Second, about "Other Theft":
    • On the opposite trend, other theft has been increasing, from around 200 to almost 500 crimes per month.
    • Is it because stealing a car became too risky, but thieves still need to "make a living"?
  • About "Theft from Vehicle":
    • It is the most frequent type.
    • It decreased along with "Theft of Vehicle" until 2012, but then it increased significantly.
  • Finally, about "Theft of Bicycle":
    • We can see a clear trend within the year. It has peaks during summer months, which is expected.
    • The average has also been increasing.

That's it for this section!

In [33]:
# Export the processed data to a csv that will be used in the next sections
# df.to_csv('crimes.csv')

Comments

comments powered by Disqus