Cross-validation tutorial

This code is available as examples/cross_validation_demo.py.

N-fold cross-validation involves splitting the data into N chunks, where N is the number of folds. For each fold, all-but-one of the chunks is used for training and the remaining chunk for testing. This is repeated so that each chunk is used exactly once for testing. The cross-validation error is then the mean error over all cross-validation folds. In the special case where the number of folds is equal to the number of samples, this is called leave-one-out crossvalidation.

Oger supports cross-validation of datasets. As a tutorial example, we will redo the 30th order NARMA experiment, but this time using different forms of cross-validation. We start of with the same dataset and flow definitions:

inputs, outputs = Oger.datasets.narma30(sample_len=1000)
data = [inputs, zip(inputs, outputs)]

reservoir = Oger.nodes.ReservoirNode(output_dim=100, input_scaling=0.1)
readout = Oger.nodes.RidgeRegressionNode(0)

flow = mdp.Flow([reservoir, readout]) 

Next, instead of training and executing the flow manually on training and testing data, we use the Oger function validate:

errors = Oger.evaluation.validate(data, flow, Oger.utils.nrmse, cross_validate_function=Oger.evaluation.train_test_only, training_fraction=0.5)

This function takes a dataset (in the MDP format, i.e. a list of iterables) and a flow, and additionally a loss-function (in this case the nrmse), a cross-validation function and additional keyword-arguments which are passed on to the cross-validation function (in this case training_fraction=0.5).

The cross-validation function determines the type of cross-validation: here we use the special case of simple training and testing, using 9/10 of the dataset for training (training_fraction=0.9). So, this is equivalent to the original NARMA tutorial, except that here the examples used for training are selected randomly. The validate function returns a list of errors, one for each fold (which is in this case only a single value).

We can repeat this same experiment but using 5-fold cross-validation as follows:

errors = Oger.evaluation.validate(data, flow, Oger.utils.nrmse, n_folds=5)

Notice how we don't explicitly pass the cross-validation-function, because we want the default value of Oger.evaluation.n_fold_random. We also use a different keyword argument for the cross-validation function, namely the number of folds. The return argument errors is in this case a list of five values.

Finally, we can maximize the amount of data used for training by using leave-one-out cross-validation as follows:

errors = Oger.evaluation.validate(data, flow, Oger.utils.nrmse, cross_validate_function=Oger.evaluation.leave_one_out)

This cross-validation function does not use any additional keyword arguments, since the number of folds is determined by the size of the dataset (in this case 10 examples). The errors will be a list of 10 values in this case.