Introduction

The goal of this analysis is to check if there is any correlation between google searches of crime and the total number of crimes in Vancouver. The assumption is that the number of searches reflects what's going on in the real world and people's sentiment.


Google Trends shows how often a search-term is entered relative to the total search-volume. About the data from Google Trends:

"Numbers represent search interest relative to the highest point on the chart for the given region and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. Likewise a score of 0 means the term was less than 1% as popular as the peak."

For this analysis, I'm using the search-term crime, for British Columbia, period from 2004-01-01 to 2017-06-30. You can check it here.


The Crime data set

The data set is the same I've been using in my series Crime in Vancouver.

It comes from the Vancouver Open Data Catalogue.

It was extracted on 2017-07-18 and it contains 530,652 records from 2003-01-01 to 2017-07-13.

The data set was cleaned and transformed in my previous post "An Exploratory Data Analysis of Crime in Vancouver from 2003 to 2017"


Importing the Data Analysis and Visualization packages


In [1]:
# Import data manipulation packages
import numpy as np
import pandas as pd

# Import data visualization packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Ignore warning (Optional)
import warnings
warnings.filterwarnings('ignore')

Processing and Transforming the data


Importing the Google Trend Data

In [2]:
# Note: before importing the csv, I cleaned the header and adjusted the date column
# to include the last day of the month
googletrend = pd.read_csv('googletrend.csv', index_col='Month')
In [3]:
# Taking a look at the first entries
googletrend.head()
Out[3]:
Search Index
Month
2004-01-31 99
2004-02-29 98
2004-03-31 100
2004-04-30 83
2004-05-31 82
In [4]:
# Checking index and data types
googletrend.info()

Index: 162 entries, 2004-01-31 to 2017-06-30
Data columns (total 1 columns):
Search Index    162 non-null int64
dtypes: int64(1)
memory usage: 2.5+ KB
  • We can see that the Google Trend data has 162 entries, which are monthly values from 2004-01 to 2017-06.
  • The search index column shows the popularity of the search. A value of 100 is the peak of popularity. Other values are relative to the peak.


Importing the Crime data

In [5]:
# Importing CSV file
crimes = pd.read_csv('crimes.csv')

# The crime data starts from 2003, but our Google data starts from 2004. 
# Let's remove 2003 from our crime data.
crimes = crimes[crimes['DATE'] > '2003-12-31']

# Make the date column the index of the data frame.
crimes.index = pd.DatetimeIndex(crimes['DATE']) 

# The crime data lists all individual crimes. 
# We need to group it by month to compare it to the Google trend.
crimes_month = pd.DataFrame(crimes.resample('M').size()) 
In [6]:
crimes_month.info()

DatetimeIndex: 162 entries, 2004-01-31 to 2017-06-30
Freq: M
Data columns (total 1 columns):
0    162 non-null int64
dtypes: int64(1)
memory usage: 2.5 KB

Now the crime_month data has a similar shape to the Google trend data. 162 entries and the same period.

In [7]:
# Just renaming the column...
crimes_month.columns = ['Total']

# Taking a look at the data
crimes_month.head()
Out[7]:
Total
DATE
2004-01-31 3767
2004-02-29 3697
2004-03-31 4254
2004-04-30 4116
2004-05-31 4042

The Total columns is the total number of crimes per month. To make it comparable to the Google trends, let's make a "crime index", in which the month that had the most number of crime will have a value of 100 and others will be relative to it.

In [8]:
# Dividing the total number of crimes by the maximum value and round them
crimes_month['Crime Index'] = (crimes_month['Total']/crimes_month['Total']
                               .max()*100).astype(int)
In [9]:
crimes_month.head()
Out[9]:
Total Crime Index
DATE
2004-01-31 3767 82
2004-02-29 3697 81
2004-03-31 4254 93
2004-04-30 4116 90
2004-05-31 4042 88

Now let's join the two data frames.

In [10]:
crime_trend = pd.concat([crimes_month['Crime Index'],googletrend], axis =1)
In [11]:
crime_trend.head()
Out[11]:
Crime Index Search Index
2004-01-31 82 99
2004-02-29 81 98
2004-03-31 93 100
2004-04-30 90 83
2004-05-31 88 82

Now we have our data set called crime_trend.


Analyzing Correlation


Let's start with a plot of crime index and Google trends.

In [12]:
crime_trend.plot(figsize=(12,6), linewidth=3)
plt.title('Crime Index and Google Trend', fontsize=16)
plt.tick_params(labelsize=14)
plt.legend(prop={'size':14});

Now let's smooth the trends by using a 24 months moving average.

In [13]:
# Making a rolling average with a 24 window
crime_trend.rolling(window=24).mean().dropna().plot(figsize=(12,6), linewidth=3)
plt.title('Crime Index and Google Trend - 24 Months Moving Average', fontsize=16)
plt.tick_params(labelsize=14)
plt.legend(prop={'size':14});

Now we can see the trends!

  • Crime index and Search index decrease together until 2009.
  • From 2009 the Crime index keeps decreasing but the Search index stabilizes.
  • From 2013, both increase.

We can visually identify a correlation, but let's check it for each period.


Period before 2009

Let's check the correlation of the moving averages for the period before 2009.

In [14]:
# Creating the moving average and assigning it to a data frame
crime_before2009_rolling24 = (crime_trend[crime_trend.index < '2009-01-01']
                              .rolling(window=24).mean().dropna())
In [15]:
# Plot
crime_before2009_rolling24.plot(figsize=(8,4), linewidth=3)
plt.title('Crime Index and Google Trend - Moving Average - Before 2009', fontsize=16)
plt.tick_params(labelsize=14)
plt.legend(prop={'size':14});

To check the correlation, let's use some different methods.

In [16]:
# Using pandas corr to find out the Pearson correlation
crime_before2009_rolling24.corr()
Out[16]:
Crime Index Search Index
Crime Index 1.00000 0.98755
Search Index 0.98755 1.00000
  • 0.98 Person correlation, almost a perfect score.

Another way to visualize it is to make a scatter plot with a linear regression.

In [17]:
# Using seaborn joint plot
sns.jointplot(x='Crime Index',y='Search Index',data=crime_before2009_rolling24, kind='reg')\
.fig.suptitle('Crime Index vs Search Index', fontsize=16)
plt.subplots_adjust(top=0.9)


Period after 2013

Let's do the same for the period after 2013

In [18]:
# Note: after 2013 regarding the moving average. 
# Because it is a 24 months window, the data is after 2011
crime_from2013_rolling24 = (crime_trend[crime_trend.index >= '2011-01-01']
                            .rolling(window=24).mean().dropna())
In [19]:
# Plot
crime_from2013_rolling24.plot(figsize=(8,4), linewidth=3)
plt.title('Crime Index and Google Trend - Moving Average - After 2013', fontsize=16)
plt.tick_params(labelsize=14)
plt.legend(prop={'size':14});
In [20]:
crime_from2013_rolling24.corr()
Out[20]:
Crime Index Search Index
Crime Index 1.000000 0.994265
Search Index 0.994265 1.000000
In [21]:
# Using seaborn joint plot
sns.jointplot(x='Crime Index',y='Search Index',data=crime_from2013_rolling24, kind='reg')\
.fig.suptitle('Crime Index vs Search Index', fontsize=16)
plt.subplots_adjust(top=0.9)
  • 0.99 Pearson correlation!

Using a 6 months moving average

A moving average with 24 months is great to see the general trend but it smooths out too much. Let's redo it with a 6 months window.

In [22]:
# Now let's use a 6 months window
crime_from2013_rolling6 = (crime_trend[crime_trend.index >= '2011-01-01']
                            .rolling(window=6).mean().dropna())
In [23]:
# Plot
crime_from2013_rolling6.plot(figsize=(8,4), linewidth=3)
plt.title('Crime Index and Google Trend - Moving Average - After 2013', fontsize=16)
plt.tick_params(labelsize=14)
plt.legend(prop={'size':14});

This is interesting. We can see that there is a lag between crime index and search index. When crime increases, it takes a while until searches increase. When crime index is reaching its local peak, the search start increasing.

Now let's redo this plot with a lag in the search index.

In [24]:
# Using .shift(-5) to lag the search index
crime_from2013_rolling6_lagged = (pd.concat([crime_from2013_rolling6['Crime Index'],
                                             crime_from2013_rolling6['Search Index']
                                             .shift(-5)], axis=1))

crime_from2013_rolling6_lagged.columns = ['Crime Index','Search Index (lagged)']
In [25]:
crime_from2013_rolling6_lagged.plot(figsize=(8,4), linewidth=3)
plt.title('Crime Index and Google Trend (Lagged) - Moving Average - After 2013', fontsize=16)
plt.tick_params(labelsize=14)
plt.legend(prop={'size':14});
In [26]:
crime_from2013_rolling6_lagged.corr()
Out[26]:
Crime Index Search Index (lagged)
Crime Index 1.000000 0.927588
Search Index (lagged) 0.927588 1.000000

For the 6 months moving average, lagging the Search Index, the correlations is still very high at 0.92!


Conclusion

There is an almost perfect correlation between the moving average of searches for crime and the total number of crimes in Vancouver (for the period of 2006 to 2009 and 2013 to 2017). This indicates that when crime increases or decreases, so does people's sentiment, which is reflected on Google searches.


Check more about crime in Vancouver.


Comments

comments powered by Disqus