Introduction¶
This post provides a brief overview of data profiling tools and a python custom function to run a mask analysis.
Data Profiling¶
The idea of Data Profiling is to use descriptive statistics to make an informative summary of the data. It's an interesting process and it's very useful in data quality projects and data warehouse.
There are several options of software for Data Quality. In the course 'Data Quality Improvement' at BCIT, we used the Ataccama DQ Analyzer. It's a free desktop software.
In this post, I'm using the data set Business Licence (edited version) from the Vancouver Open Data Catalogue as an example.
Screenshot from Ataccama DQ Analyzer
In the screenshot, we can see the basic statistics, frequency and mask analysis.
The advantage of using a Data Quality software is that once you import the data, it will make the profile for all columns. It is easy to navigate and if you are connected to a database, you can drill down by simply double-clicking a field.
But if you just want to make a quick profile, you are using python and don't want to import the data into another software, there are some pandas function that can do it. Additionally, there is a nice package called pandas-profiling. However, it does not have a mask analyzer, so I'm providing an additional custom function.
Data Profiling with Pandas¶
# Import pandas
import pandas as pd
# Read the data set
df = pd.read_csv('business_licences.csv')
# Using pandas .info() shows columns and number of entries
df.info()
# Pandas .describe() shows basic statistics for columns that hold numeric values
df.describe()
# Pandas .value_counts() shows frequency
df['Country'].value_counts()
# Pandas .value_counts(normalize=True) shows frequency in percentage
df['Country'].value_counts(normalize=True)
import pandas_profiling
pandas_profiling.ProfileReport(df)
This package gives some nice warning and for numerical fields, it also gives a histogram.
It doesn't have a Mask Analysis. So I wrote one myself.
Mask Analysis custom function¶
The mask analysis or string pattern is useful for fields like city, postal code, phone, etc. It shows us how the fields have been populated and we can infer some data quality issues.
Rules:
- lower case letter returns 'l'
- Capital case letter returns 'L'
- Number returns 'D'
- Space returns 's'
- Special character returns itself
- Missing value returns '-null-'
Examples:
- 'Van' returns 'Lll'
- 'VAN' returns 'LLL'
- 'Van BC' returns 'LllsLL'
- '+1 123-1234-5555 returns '+DsDDD-DDDD-DDDD'
- The standard for the Canadian Postal Code should be 'LDLsDLD'
Let's run the function and see an example:
def mask_profile(series):
'''
Make a mask profile of a field by converting the ascii value of the character as below.
a to z --> returns "l" for letter
A to Z --> returns "L" for Letter
0 to 9 --> returns "D" for Digit
'space' --> returns "s" for space
Special characters --> keep original
Requirement: pandas
Input: pandas Series
'''
def getMask(field):
mask = ''
if str(field) == 'nan':
mask = '-null-'
else:
for character in str(field):
if 65 <= ord(character) <= 90: # ascii 65 to 90 are Capital letters
mask = mask + 'L'
elif 97 <= ord(character) <= 122: #ascii 97 to 122 are lower case letters
mask = mask + 'l'
elif 48 <= ord(character) <= 57: #ascii 48 to 57 are digits
mask = mask + 'D'
elif ord(character) == 32:
mask = mask + 's'
else:
mask = mask + character
return mask
value = series.apply(getMask).value_counts()
percentage = round(series.apply(getMask).value_counts(normalize=True)*100,2)
result = pd.DataFrame(value)
result['%'] = pd.DataFrame(percentage)
result.columns = ['Count','%']
return result
To use this function:
# Use mask_profile(pass a pandas series)
# I'm using head(10) to show only the top 10 results
mask_profile(df['PostalCode']).head(10)
Results:
- We can see that the first result follows the correct standard.
- There are 13.56% missing values.
- There are records that have an extra space on the end LDLsDLDs, some with extra space on the middle and some on the start.
This shows us that this field allows for an open entry and some data quality rules, such as trimming extra spaces, should be done to clean it.
Comments
comments powered by Disqus