Reviews: Putting An End to End-to-End: Gradient-Isolated Learning of Representations

In the present manuscript the authors propose greedy InfoMax, a greedy algorithm which allows unsupervised learning in deep neural networks with state of the art performance. Specifically, the algorithm leverages implicit label information which is encoded temporally in the streaming data. Importantly, the present work rests on the shoulders and success of Contrastive Predictive Coding, but dispenses with end-to-end training entirely. Getting greedy layer-wise unsupervised learning to perform at such levels is quite impressive and will without doubt have an important impact on the community. The work is original and the quality of the writing and figures seems quite high. What I would have liked to see is a more in depth review of the precise data generation process. To understand approximately what was done, I had to go back two generations of papers. It would be useful to make the manuscript a little more self-contained in that regard. When comparing to the results from Nokland and Eidnes it might be worth noting that their similarity loss (although computed from labels) does not fully use the label information, but merely uses the information of which samples belong to the same class and which ones belong to different classes. I feel this is similar in spirit to the present approach where this grouping is done in time. It might be worth mentioning this where Nokland and Eidnes is discussed. The introduction paragraph on the implausibility of backprop could be moderated a little. It seems to be one of the open questions at the moment whether and how the brain backpropagates information. The stronger and less controversial point is rather the inaccessibility of label information. Finally, this might not be feasible for computational reasons, but it would be nice to report error margins on the accuracy results in Table 1 and 2 for different initial conditions of the network. UPDATE: Thanks for the detailed answers to the reviewers. I think this is an important contribution. I am looking forward to the error bars and the improved description of the (non-trivial) data generation/preprocessing steps. Also, any insights and discussion on the underlying connections to the similarly matching loss case of Nokland et al. would be welcomed.

Quality: I view this as a significant result. It is interesting that a competitive self-supervised learning scheme exists which only relies on limited backpropagation, and this understanding might be a step towards biologically plausible learning rules. Clarity: I found this paper and methods to be reasonably clear, but I had several questions about the techniques. In particular, I don’t understand how the authors implemented an autoregressive model without backpropagation through time Originality and Significance: This work seems to be original and significant. Neural systems receive large amounts of unlabeled data, and need to draw semantic features from them a biologically plausible way, and this might be a step towards understanding how this is accomplished. In addition, these schemes might have benefits for distributed training.

Paper ID:	1730
Title:	Putting An End to End-to-End: Gradient-Isolated Learning of Representations

Reviewer 1

Reviewer 2

Reviewer 3