What algorithm is better to use when processing the data?
The information gain of an input is the difference between the entropy of the output and the entropy of output given an input. The algorithm tries to pick a variable that maximizes the information gain minimizing the uncertainty in the leaf, going from a general case to the specific case.
To reduce the model bias is necessary that the number of nodes is limiting by ensuring that each leaf has a reasonable quantity of data to represents.
This algorithm usually works better on nominal attributes, so the first thing to do is to split the numeric ones into bins. To do that the “Discretize filter” has been used for the following variables: Age, Years at address, Current debt, Income, and Loan amount. There are some disadvantages in discretizing variables: the function is not smooth, it cannot make choices inside the bin, and it jumps to the next one once crosses the boundaries. This means that if two cases fall into the same bin, the probability of these two cases is the same according to the model, however, without discretizing the model can show that the probability is not the same. Also, it is necessary to pay attention to choose the right bin sizes, and the starting and ending value of them (the smallest and the highest value in the training data) will affect the distribution, and the result compares to the test data.