Example results on MS-COCO and NUS-WIDE "with" and "without" knowledge distillation using our proposed framework. The texts on the right are the top-3 predictions, where correct ones are shown in blue and incorrect in red. The green bounding boxes in images are the top-10 proposals detected by the weakly-supervised detection model.


Multi-label image classification (MLIC) is a fundamental but challenging task towards general visual understanding. Existing methods found the region-level cues (e.g., features from RoIs) can facilitate multi-label classification. Nevertheless, such methods usually require laborious object-level annotations (i.e., object labels and bounding boxes) for effective learning of the object-level visual features. In this paper, we propose a novel and efficient deep framework to boost multi-label classification by distilling knowledge from weakly-supervised detection task without bounding box annotations. Specifically, given the image-level annotations, (1) we first develop a weakly-supervised detection (WSD) model, and then (2) construct an end-to-end multi-label image classification framework augmented by a knowledge distillation module that guides the classification model by the WSD model according to the class-level predictions for the whole image and the object-level visual features for object RoIs. The WSD model is the teacher model and the classification model is the student model. After this cross-task knowledge distillation, the performance of the classification model is significantly improved and the efficiency is maintained since the WSD model can be safely discarded in the test phase. Extensive experiments on two large-scale datasets (MS-COCO and NUS-WIDE) show that our framework achieves superior performances over the state-of-the-art methods on both performance and efficiency.



Correct predictions are shown in blue and incorrect in red.



The proposed framework works with two steps: (1) we first develop a WSD model as teacher model (called T-WDet) with only image-level annotations y; (2) then the knowledge in T-WDet is distilled into the MLIC student model (called S-Cls) via feature-level distillation from RoIs and prediction-level distillation from the whole image, where the former is conducted by optimizing the loss in Eq. (3) while the latter is conducted by optimizing the losses in Eq. (5) and Eq. (10).

In this paper, we propose a novel and efficient deep framework to boost MLIC by distilling the unique knowledge from WSD into classification with only image-level annotations.

Specifically, our framework works with two steps:

Ablation Study

Overall Ablation


Region Proposal





The improvements of S-Cls model over each class/concept on MS-COCO (upper figure) and NUS-WIDE (lower figure) after knowledge distillation with our framework. "*k" indicates the number (divided by 1000) of images including this class/concept. The classes/concepts in horizontal axis are sorted by the number "*k" from large to small.


Please refer to the GitHub repository for more details.


Yongcheng Liu, Lu Sheng, Jing Shao, Junjie Yan, Shiming Xiang and Chunhong Pan, “Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection”, in ACM International Conference on Multimedia (MM), 2018. [ACM DL] [arXiv]

  author = {Yongcheng Liu and    
            Lu Sheng and    
            Jing Shao and   
            Junjie Yan and   
            Shiming Xiang and   
            Chunhong Pan},   
  title = {Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection},   
  booktitle = {ACM International Conference on Multimedia},    
  pages = {700--708},  
  year = {2018}