Introduction


This post provides a brief overview of data profiling tools and a python custom function to run a mask analysis.



Data Profiling


The idea of Data Profiling is to use descriptive statistics to make an informative summary of the data. It's an interesting process and it's very useful in data quality projects and data warehouse.

There are several options of software for Data Quality. In the course 'Data Quality Improvement' at BCIT, we used the Ataccama DQ Analyzer. It's a free desktop software.

In this post, I'm using the data set Business Licence (edited version) from the Vancouver Open Data Catalogue as an example.


Screenshot from Ataccama DQ Analyzer

In the screenshot, we can see the basic statistics, frequency and mask analysis.

The advantage of using a Data Quality software is that once you import the data, it will make the profile for all columns. It is easy to navigate and if you are connected to a database, you can drill down by simply double-clicking a field.

But if you just want to make a quick profile, you are using python and don't want to import the data into another software, there are some pandas function that can do it. Additionally, there is a nice package called pandas-profiling. However, it does not have a mask analyzer, so I'm providing an additional custom function.



Data Profiling with Pandas


In [1]:
# Import pandas
import pandas as pd
In [2]:
# Read the data set
df = pd.read_csv('business_licences.csv')
In [3]:
# Using pandas .info() shows columns and number of entries
df.info()

RangeIndex: 61398 entries, 0 to 61397
Data columns (total 7 columns):
LicenceRSN           61398 non-null int64
City                 53339 non-null object
Province             53324 non-null object
Country              53363 non-null object
PostalCode           53075 non-null object
NumberOfEmployees    61398 non-null float64
FeePaid              51635 non-null float64
dtypes: float64(2), int64(1), object(4)
memory usage: 3.3+ MB
In [4]:
# Pandas .describe() shows basic statistics for columns that hold numeric values
df.describe()
Out[4]:
LicenceRSN NumberOfEmployees FeePaid
count 6.139800e+04 61398.000000 51635.000000
mean 2.605270e+06 11.502218 310.544360
std 5.447688e+04 1214.855678 803.488849
min 2.394142e+06 0.000000 -40.000000
25% 2.574301e+06 0.000000 136.000000
50% 2.589652e+06 1.000000 136.000000
75% 2.605020e+06 4.000000 178.000000
max 2.876442e+06 300813.000000 43459.000000
In [5]:
# Pandas .value_counts() shows frequency
df['Country'].value_counts()
Out[5]:
CAN               53273
USA                  83
BRASIL                1
UK                    1
Canada                1
United Kingdom        1
Finland               1
MEX                   1
Mexico                1
Name: Country, dtype: int64
In [6]:
# Pandas .value_counts(normalize=True) shows frequency in percentage
df['Country'].value_counts(normalize=True)
Out[6]:
CAN               0.998313
USA               0.001555
BRASIL            0.000019
UK                0.000019
Canada            0.000019
United Kingdom    0.000019
Finland           0.000019
MEX               0.000019
Mexico            0.000019
Name: Country, dtype: float64


Using pandas-profiling package


View the official link for information about the pandas-profiling package.

Note: on my basic laptop and running on the full data set, the package did not work. It froze out. I recommend saving our your work before running it...

In [7]:
import pandas_profiling
In [8]:
pandas_profiling.ProfileReport(df)
Out[8]:

Overview

Dataset info

Number of variables 7
Number of observations 61398
Total Missing (%) 9.8%
Total size in memory 3.3 MiB
Average record size in memory 56.0 B

Variables types

Numeric 3
Categorical 4
Date 0
Text (Unique) 0
Rejected 0

Warnings

  • City has 8059 / 13.1% missing values Missing
  • City has a high cardinality: 260 distinct values Warning
  • Country has 8035 / 13.1% missing values Missing
  • FeePaid has 9763 / 15.9% missing values Missing
  • NumberOfEmployees has 24932 / 40.6% zeros
  • NumberOfEmployees is highly skewed (γ1 = 247.25)
  • PostalCode has 8323 / 13.6% missing values Missing
  • PostalCode has a high cardinality: 16043 distinct values Warning
  • Province has 8074 / 13.2% missing values Missing
  • Dataset has 11 duplicate rows Warning

Variables

City
Categorical

Distinct count 260
Unique (%) 0.5%
Missing (%) 13.1%
Missing (n) 8059
Vancouver
45792
Burnaby
 
1209
Surrey
 
1174
Other values (256)
 
5164
(Missing)
 
8059
Value Count Frequency (%)  
Vancouver 45792 74.6%
 
Burnaby 1209 2.0%
 
Surrey 1174 1.9%
 
North Vancouver 818 1.3%
 
Richmond 746 1.2%
 
Coquitlam 555 0.9%
 
Langley 486 0.8%
 
Port Coquitlam 305 0.5%
 
Delta 280 0.5%
 
Maple Ridge 240 0.4%
 
Other values (249) 1734 2.8%
 
(Missing) 8059 13.1%
 

Country
Categorical

Distinct count 10
Unique (%) 0.0%
Missing (%) 13.1%
Missing (n) 8035
CAN
53273
USA
 
83
BRASIL
 
1
Other values (6)
 
6
(Missing)
 
8035
Value Count Frequency (%)  
CAN 53273 86.8%
 
USA 83 0.1%
 
BRASIL 1 0.0%
 
UK 1 0.0%
 
Canada 1 0.0%
 
United Kingdom 1 0.0%
 
Finland 1 0.0%
 
MEX 1 0.0%
 
Mexico 1 0.0%
 
(Missing) 8035 13.1%
 

FeePaid
Numeric

Distinct count 1544
Unique (%) 3.0%
Missing (%) 15.9%
Missing (n) 9763
Infinite (%) 0.0%
Infinite (n) 0
Mean 310.54
Minimum -40
Maximum 43459
Zeros (%) 0.0%

Quantile statistics

Minimum -40
5-th percentile 65
Q1 136
Median 136
Q3 178
95-th percentile 1178
Maximum 43459
Range 43499
Interquartile range 42

Descriptive statistics

Standard deviation 803.49
Coef of variation 2.5874
Kurtosis 358.96
Mean 310.54
MAD 293.95
Skewness 13.84
Sum 16035000
Variance 645590
Memory size 479.8 KiB
Value Count Frequency (%)  
136.0 19294 31.4%
 
65.0 4465 7.3%
 
176.0 3061 5.0%
 
165.0 2967 4.8%
 
493.0 1003 1.6%
 
253.0 928 1.5%
 
22.0 777 1.3%
 
130.0 727 1.2%
 
205.0 656 1.1%
 
714.0 554 0.9%
 
Other values (1533) 17203 28.0%
 
(Missing) 9763 15.9%
 

Minimum 5 values

Value Count Frequency (%)  
-40.0 1 0.0%
 
1.0 1 0.0%
 
2.0 520 0.8%
 
2.1 1 0.0%
 
3.0 5 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
20783.0 1 0.0%
 
20827.0 1 0.0%
 
27879.0 1 0.0%
 
28810.0 1 0.0%
 
43459.0 1 0.0%
 

LicenceRSN
Numeric

Distinct count 61381
Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2605300
Minimum 2394142
Maximum 2876442
Zeros (%) 0.0%

Quantile statistics

Minimum 2394142
5-th percentile 2562000
Q1 2574300
Median 2589700
Q3 2605000
95-th percentile 2736200
Maximum 2876442
Range 482300
Interquartile range 30718

Descriptive statistics

Standard deviation 54477
Coef of variation 0.02091
Kurtosis 5.899
Mean 2605300
MAD 34970
Skewness 2.4051
Sum 159958368718
Variance 2967700000
Memory size 479.8 KiB
Value Count Frequency (%)  
2605510 8 0.0%
 
2561610 2 0.0%
 
2584544 2 0.0%
 
2717711 2 0.0%
 
2561137 2 0.0%
 
2691436 2 0.0%
 
2578467 2 0.0%
 
2578518 2 0.0%
 
2578466 2 0.0%
 
2709599 2 0.0%
 
Other values (61371) 61372 100.0%
 

Minimum 5 values

Value Count Frequency (%)  
2394142 1 0.0%
 
2412175 1 0.0%
 
2422877 1 0.0%
 
2428084 1 0.0%
 
2436909 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
2855969 1 0.0%
 
2860761 1 0.0%
 
2867402 1 0.0%
 
2873411 1 0.0%
 
2876442 1 0.0%
 

NumberOfEmployees
Numeric

Distinct count 230
Unique (%) 0.4%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 11.502
Minimum 0
Maximum 300810
Zeros (%) 40.6%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 1
Q3 4
95-th percentile 20
Maximum 300810
Range 300810
Interquartile range 4

Descriptive statistics

Standard deviation 1214.9
Coef of variation 105.62
Kurtosis 61220
Mean 11.502
MAD 17.778
Skewness 247.25
Sum 706210
Variance 1475900
Memory size 479.8 KiB
Value Count Frequency (%)  
0.0 24932 40.6%
 
1.0 11364 18.5%
 
2.0 6203 10.1%
 
3.0 3473 5.7%
 
4.0 2477 4.0%
 
5.0 2089 3.4%
 
6.0 1516 2.5%
 
10.0 1105 1.8%
 
8.0 1009 1.6%
 
7.0 898 1.5%
 
Other values (220) 6332 10.3%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 24932 40.6%
 
1.0 11364 18.5%
 
1.5 2 0.0%
 
1.7 1 0.0%
 
2.0 6203 10.1%
 

Maximum 5 values

Value Count Frequency (%)  
1110.0 1 0.0%
 
1500.0 1 0.0%
 
4000.0 3 0.0%
 
4500.0 1 0.0%
 
300813.0 1 0.0%
 

PostalCode
Categorical

Distinct count 16043
Unique (%) 30.2%
Missing (%) 13.6%
Missing (n) 8323
 
396
V5Z 2M9
 
208
V5Z 4C2
 
170
Other values (16039)
52301
(Missing)
 
8323
Value Count Frequency (%)  
396 0.6%
 
V5Z 2M9 208 0.3%
 
V5Z 4C2 170 0.3%
 
V6H 3R4 169 0.3%
 
V6B 6N9 119 0.2%
 
V6C 1T2 114 0.2%
 
V6Z 1N9 109 0.2%
 
V6Z 1G3 107 0.2%
 
V6C 3L2 100 0.2%
 
V6Z 2H2 98 0.2%
 
Other values (16032) 51485 83.9%
 
(Missing) 8323 13.6%
 

Province
Categorical

Distinct count 34
Unique (%) 0.1%
Missing (%) 13.2%
Missing (n) 8074
BC
53043
ON
 
107
AB
 
57
Other values (30)
 
117
(Missing)
 
8074
Value Count Frequency (%)  
BC 53043 86.4%
 
ON 107 0.2%
 
AB 57 0.1%
 
CA 28 0.0%
 
QC 17 0.0%
 
WA 11 0.0%
 
FL 7 0.0%
 
NY 7 0.0%
 
MA 5 0.0%
 
PQ 5 0.0%
 
Other values (23) 37 0.1%
 
(Missing) 8074 13.2%
 

Sample

LicenceRSN City Province Country PostalCode NumberOfEmployees FeePaid
0 2853550 Saint-Laurent QC CAN H4T 1T3 0.0 NaN
1 2598351 Langley BC CAN V2Y 0J7 2.0 136.0
2 2567170 New Wesminster BC CAN V3L 0E9 0.0 NaN
3 2570995 NaN NaN NaN NaN 0.0 65.0
4 2572894 Abbotsford BC CAN V2S 8A7 0.0 136.0

This package gives some nice warning and for numerical fields, it also gives a histogram.

It doesn't have a Mask Analysis. So I wrote one myself.



Mask Analysis custom function


The mask analysis or string pattern is useful for fields like city, postal code, phone, etc. It shows us how the fields have been populated and we can infer some data quality issues.

Rules:

  • lower case letter returns 'l'
  • Capital case letter returns 'L'
  • Number returns 'D'
  • Space returns 's'
  • Special character returns itself
  • Missing value returns '-null-'

Examples:

  • 'Van' returns 'Lll'
  • 'VAN' returns 'LLL'
  • 'Van BC' returns 'LllsLL'
  • '+1 123-1234-5555 returns '+DsDDD-DDDD-DDDD'
  • The standard for the Canadian Postal Code should be 'LDLsDLD'

Let's run the function and see an example:

In [9]:
def mask_profile(series):
    '''
    Make a mask profile of a field by converting the ascii value of the character as below.
    a to z --> returns "l" for letter
    A to Z --> returns "L" for Letter
    0 to 9 --> returns "D" for Digit
    'space' --> returns "s" for space
    Special characters --> keep original
    Requirement: pandas
    Input: pandas Series
    '''    
    def getMask(field):
        mask = ''
        if str(field) == 'nan':
            mask = '-null-'
        else:    
            for character in str(field):
                if 65 <= ord(character) <= 90: # ascii 65 to 90 are Capital letters
                    mask = mask + 'L'                
                elif 97 <= ord(character) <= 122: #ascii 97 to 122 are lower case letters
                    mask = mask + 'l'
                elif 48 <= ord(character) <= 57: #ascii 48 to 57 are digits
                    mask = mask + 'D'
                elif ord(character) == 32:
                    mask = mask + 's'
                else:
                    mask = mask + character
        return mask
    value = series.apply(getMask).value_counts()
    percentage = round(series.apply(getMask).value_counts(normalize=True)*100,2)
    result = pd.DataFrame(value)
    result['%'] = pd.DataFrame(percentage)
    result.columns = ['Count','%']
    return result

To use this function:

In [10]:
# Use mask_profile(pass a pandas series)
# I'm using head(10) to show only the top 10 results
mask_profile(df['PostalCode']).head(10)
Out[10]:
Count %
LDLsDLD 51072 83.18
-null- 8323 13.56
LDLsDLDs 665 1.08
LDLDLD 497 0.81
s 396 0.64
LDLssDLD 193 0.31
DDDDD 75 0.12
sLDLsDLD 47 0.08
LDLsDLL 16 0.03
sLDLsDLDs 13 0.02

Results:

  • We can see that the first result follows the correct standard.
  • There are 13.56% missing values.
  • There are records that have an extra space on the end LDLsDLDs, some with extra space on the middle and some on the start.

This shows us that this field allows for an open entry and some data quality rules, such as trimming extra spaces, should be done to clean it.


Comments

comments powered by Disqus