Before starting to implement a model, the data has been split into two parts: training and validation data will be 70% and test data 30%. To have a fair random division, Weka has a utility called “RemovePercentage” that allows removing some data, and through the flag “InsertSelection,” to have the rest of it. Test data is composed of 596 rows and training, and validation data are composed by 1391 rows. The first one is a...
Before starting to implement a model, the data has been split into two parts: training and validation data will be 70% and test data 30%. To have a fair random division, Weka has a utility called “RemovePercentage” that allows removing some data, and through the flag “InsertSelection,” to have the rest of it. Test data is composed of 596 rows and training, and validation data are composed by 1391 rows.
The first one is a tree building algorithm called J48 in Weka. It looks like a tree and the algorithm decides what goes on the top and on the bottom based on what gives its more information gain. Each branching is a decision based on the variable choose (represented by a node), and the final leaf is a classification.
To choose which variable goes to the top, the algorithm follows the “Divide and Conquer” method: one variable at a time is chosen to try to minimize the error, so one variable is chosen to be at the top and then for each value a branch is created, and the data is split up.
The top node is the input variable gives average less uncertainty about the output. The uncertainty is measured by Entropy. The Entropy is higher if the uncertainty of the output is less sure and smaller if the output is surer.