2010: Breakthrough of supervised deep learning. No unsupervised pre-training. The rest is history.
Jürgen Schmidhuber (9/2/2020)
In 2020, we are celebrating the 10-year anniversary of our publication [MLP1] in Neural Computation (2010) on deep multilayer perceptrons trained by plain gradient descent on GPU. Surprisingly, our simple but unusually deep supervised artificial neural network (NN) outperformed all previous methods on the (back then famous) machine learning benchmark MNIST. That is, by 2010, when compute was 100 times more expensive than today, both our feedforward NNs and our earlier recurrent NNs (e.g., CTC-LSTM for connected handwriting recognition) were able to beat all competing algorithms on important problems of that time. In the 2010s, this deep learning revolution quickly spread from Europe to America and Asia.
Just one decade ago, many thought that deep NNs cannot learn much without unsupervised pre-training, a technique introduced by myself in 1991 [UN0-UN3] and later also championed by others, e.g., [UN4-5] [VID1] [T20]. In fact, it was claimed [VID1] that “nobody in their right mind would ever suggest” to use plain gradient descent through backpropagation [BP1] (see also [BPA-C] [BP2-6] [R7]) to train feedforward NNs (FNNs) with many layers of neurons.
However, in March 2010, our team with my outstanding Romanian postdoc Dan Ciresan [MLP1] showed that deep FNNs can indeed be trained by plain backpropagation for important applications. This neither required unsupervised pre-training nor Ivakhnenko’s incremental layer-wise training of 1965 [DEEP1-2]. By the standards of 2010, our supervised NN had many layers. It set a new performance record [MLP1] on the back then famous and widely used image recognition benchmark called MNIST [MNI]. This was achieved by greatly accelerating traditional multilayer perceptrons on highly parallel graphics processing units called GPUs, going beyond the important GPU work of Jung & Oh (2004) [GPUNN]. A reviewer called this a “wake-up call to the machine learning community.”
Our results set the stage for the recent decade of deep learning [DEC]. In February 2011, our team extended the approach to deep Convolutional NNs (CNNs) [GPUCNN1]. This greatly improved earlier work [GPUCNN]. The so-called DanNet [GPUCNN1] [R6] broke several benchmark records. In May 2011, DanNet was the first deep CNN to win a computer vision competition [GPUCNN5] [GPUCNN3]. In August 2011, it was the first to win a vision contest with superhuman performance [GPUCNN5]. Our team kept winning vision contests in 2012 [GPUCNN5]. Subsequently, many researchers adopted this technique. By May 2015, we had the first extremely deep FNNs with more than 100 layers [HW1] (compare [HW2] [HW3]).
The original successes required a precise understanding of the inner workings of GPUs [MLP1] [GPUCNN1]. Today, convenient software packages shield the user from such details. Compute is roughly 100 times cheaper than a decade ago, and many commercial NN applications are based on what started in 2010 [MLP1] [DL1-4] [DEC].
In this context it should be mentioned that right before the 2010s, our team had already achieved another breakthrough in supervised deep learning with the more powerful recurrent NNs (RNNs) whose basic architectures were introduced over half a century earlier [MC43] [K56]. My PhD student Alex Graves won three connected handwriting competitions (French, Farsi, Arabic) at ICDAR 2009, the famous conference on document analysis and recognition. He used a combination of two methods developed in my research groups at TU Munich and the Swiss AI Lab IDSIA: Supervised LSTM RNNs (1990s-2005) [LSTM0-6] (which overcome the famous vanishing gradient problem analyzed by my PhD student Sepp Hochreiter [VAN1] in 1991) and Connectionist Temporal Classification [CTC] (2006). CTC-trained LSTM was the first RNN to win international contests. Compare Sec. 4 of [MIR] and Sec. A & B of [T20].
That is, by 2010, both our supervised FNNs and our supervised RNNs were able to outperform all other methods on important problems. In the 2010s, this supervised deep learning revolution quickly spread from Europe to North America and Asia, with enormous impact on industry and daily life [DL4] [DEC]. However, it should be mentioned that the conceptual roots of deep learning reach back deep into the previous millennium [DEEP1-2] [DL1-2] [MIR] (Sec. 21 & Sec. 19) [T20] (e.g., Sec. II & D).
Finally let me emphasize that the supervised deep learning revolution of the 2010s did not really kill all variants of unsupervised learning. Many are still important. For example, pre-trained language models are now heavily used in the context of transfer learning, e.g., [TR2]. And our active & generative unsupervised NNs since 1990 [AC90-AC20] are still used to endow agents with artificial curiosity [MIR] (Sec. 5 & Sec. 6)—see also a special case of our adversarial NNs [AC90b] called GANs [AC20] [R2] [T20] (Sec. XVII). Unsupervised learning still has a bright future!