Before submitting the data to the modeling process, it is necessary to have a look to find any input error/mis-typed or data defect. These are the corrections that are being made: – In the Gender data, there are some minority values (0, 1, D, F, Female, H, M, Male, N) that need to be cleaned or corrected. 0 and Female are being changed in F, 1 and Male in M, and the rows with D, H, N are being deleted; – if Years at address is bigger than Age, the rows are being deleted; – The Income data show some outliers, there, in fact, two values that are much larger...
Before submitting the data to the modeling process, it is necessary to have a look to find any input error/mis-typed or data defect. These are the corrections that are being made:
– In the Gender data, there are some minority values (0, 1, D, F, Female, H, M, Male, N) that need to be cleaned or corrected. 0 and Female are being changed in F, 1 and Male in M, and the rows with D, H, N are being deleted;
– if Years at address is bigger than Age, the rows are being deleted;
– The Income data show some outliers, there, in fact, two values that are much larger than all the others. In this case, the possible solutions are to remove them, to collect more data to represent that case or to change this variable in a nominal one, grouping values in a range or bin. The third solution was being chosen firstly. However it had shown that most of the data were unbalanced in the bins, so the first solution has been chosen, and the data are being deleted;
– in the CCJs data, there is the same situation as the Income data, and the same logic has been applied;
– The Postcode has been split to have the area and be able to associate it with the region. There are 124 postcode areas in the UK and 12 regions. This derived attribute gives one more useful variable for the model;
– A country, in addition to the UK there are other countries (Spain, German, France). However, the postcodes refer to a UK area, and there are no such postcodes in those other countries, so the Country has been updated in these rows in the UK.
Starting from 2000 rows, five are being deleted due to incorrect Gender and four due to inconsistency in the data of Years at address, 2 for the incorrect data in Income, 2 for the incorrect data in CCJs. Hence there are 1987 rows in the final data set.