Knowledge Distillation (KD) aims at using a low-capacity model, called student, to learn from a high-capacity one, termed as teacher, such that the performance of student can be improved. Previous KD methods typically train a student by minimizing a task-related loss and the KD loss simultaneously, with the help of a loss weight hyper-parameter to balance these two terms. In this work, we propose to first transfer the backbone knowledge from a teacher to the student, and then only learn the task-head of the student network. Such a training decomposition alleviate the use of loss weight, which can be hard to define. This allows our method to be easily applied to different datasets or tasks with strong stability. Importantly, the decomposition permits the core of our method, Stage-by-Stage Knowledge Distillation (SSKD), which facilitates progressive feature mimicking from teacher to student. Extensive experiments on CIFAR-100 and ImageNet suggest that SSKD significantly narrows down the performance gap between student and teacher, outperforming state-of-the-art approaches. We also demonstrate the generalization ability of SSKD on object detection on COCO dataset. On both tasks SSKD shows significant improvements.
|