Reservoir computing (RC) has provided a new and robust way of training recurrent neural networks (RNNs) utilizing the standard linear regression for learning the output weights. Despite the simple training procedure, there are some pitfalls that should be avoided, and important parameters must be properly set to make the most of the RC modeling capabilities. The following are a few useful tips for a successful training of Echo State Networks (ESNs) [bib]Jaeger07sholarpedia[/bib], as a representative of the RC "flavors" that is probably most often used for applications.
The first step before choosing any RC method for solving a particular problem is to determine whether RC is appropriate in the first place. RC, as well as any other RNN method, is inherently suited only to process temporal data. Therefore, RC should rather not be used for modeling static data, or temporal data where observations in every timestep are independent from each other.
There are several important global control parameters that have a significant effect on ESN performance, and whose optimal settings highly depend on the given task (see [bib]Jaeger2001a[/bib] and [bib]Jaeger2002[/bib] for more):
- The spectral radius of the reservoir weight matrix (i.e., the largest absolute eigenvalue) co-determines the timescale of the reservoir and the amount of nonlinear interaction of the input components through time. For the tasks that evolve on a slower time scale or/and have long range temporal interactions the spectral radius usually should be close to 1 or beyond.
- The input scaling mostly determines the degree of nonlinearity in the model. With small input weights, the reservoir units will become only slightly excited around their resting states and behave almost like linear neurons; for tasks requiring a high level of nonlinearity, the input weight scaling must be increased.
- Besides simple tanh units, reservoirs with leaky integrator neurons are often used. This gives another important parameter to optimize, the leaking rate, which allows to directly control the timescale of the reservoir. A detailed discussion of leaky integration (and related timing issues, like subsampling) can be found in [bib]verstraeten_thesis[/bib]. Even more powerful bandpass neurons that respond to particular frequencies can be used for tasks with structure on many timescales [bib]verstraeten_thesis[/bib] [bib]Siewert2007[/bib].
Optimizing these parameters is in practice done either by manual experimentation or automated grid search. In either case, cross-validation runs are used to assess the quality of the current parameter settings. The currently most extensive available RC toolbox, the Oger toolbox, supports grid search and a number of other state-of-the art optimization routines.
Finally, as in all machine learning methods, it is important to optimize the model capacity (again, guided by cross-validation). In reservoir computing this can be done in two ways: either by changing the size of the reservoir or by regularization. Regularization can be done by inserting noise during training (as documented, e.g., in [bib]Jaeger2004[/bib]) or by ridge regression [bib]verstraeten_thesis[/bib] [bib]wyffels2008[/bib]. Note that the latter is a computationally cheap optimization, because only the readout weights have to be recomputed for every regularization parameter value, without re-running the ESN on the training data. In pattern generation tasks, it has been reported that noise insertion has advantages over ridge regression with respect to dynamical stability of the trained system [bib]Jaeger2007[/bib]. A general rule of thumb appears to be that it never (or very rarely) harms to use a large reservoir, not caring about optimizing its size, and rely solely on regularization for optimization.