Knowledge Distillation (KD) aims at using a low-capacity model, called student, to learn from a high-capacity one, termed as teacher, such that the performance of student can be improved. Previous KD methods typically train a student by minimizing a task-related loss and the KD loss simultaneously, with the help of a loss weight hyper-parameter to balance these two terms. In this work, we propose to first transfer the backbone knowledge from a teacher to the student, and then only learn the task-head of the student network. Such a training decomposition alleviate the use of loss weight, which can be hard to define. This allows our method to be easily applied to different datasets or tasks with strong stability. Importantly, the decomposition permits the core of our method, Stage-by-Stage Knowledge Distillation (SSKD), which facilitates progressive feature mimicking from teacher to student. Extensive experiments on CIFAR-100 and ImageNet suggest that SSKD significantly narrows down the performance gap between student and teacher, outperforming state-of-the-art approaches. We also demonstrate the generalization ability of SSKD on object detection on COCO dataset. On both tasks SSKD shows significant improvements.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.