What is a better way to show data of banking clients?
The data given are made from different variables. Excluding some variables will introduce a model bias that helps to have a more generalized model that can predict quite well, even with new data.
Customer ID: a numeric discrete variable that indicates an ID for each customer. This is called flat and wide variable because it has a wide and low distribution, and it does not contribute nothing to the learning process as there is one example for each data point and it is not possible to put them together. It is too specific to build a generic model.
Fictional Surname: nominal variable. It is being excluded from the list of variables, because of the same reason of the Customer ID.
Age: discrete numeric variable.
Gender: nominal variable.
Years at the address: discrete numeric variable.
Employment status: nominal variable.
Country: nominal variable. It is being excluded from the list of variables because after cleaning and manipulating, there is only one country and this does not add anything to the model. This is called a large minority value.
Current debt: numeric continuous variable.
Postcode: nominal variable. Although it is being excluded because it is a flat and wide variable, it has been used to find the region and this variable is in the model.
Income: numeric continuous variable.
Own home: nominal variable.
CCJs: discrete numeric variable.
Loan amount: numeric continuous variable.
Outcome: nominal variable. It is the output of the model.
Age, Gender, Years at address, Employment status, Current debt, Region (derived from postcode), Income, Own home, CCJs, Loan amount are the initial variables.