DMCP: Differentiable Markov Channel Pruning for Neural Networks

CVPR 2020

Yujie Wang
Quanquan Li
Junjie Yan



Abstract

Recent works imply that the channel pruning can be regarded as searching optimal sub-structure from unpruned networks. However, existing works based on this observation require training and evaluating a large number of tructures, which limits their application. In this paper, we propose a novel differentiable method for channel pruning, named Differentiable Markov Channel Pruning (DMCP), to efficiently search the optimal sub-structure. Our method is differentiable and can be directly optimized by gradient descent with respect to standard task loss and budget regularization (e.g. FLOPs constraint). In DMCP, we model the channel pruning as a Markov process, in which each state represents for retaining the corresponding channel during pruning, and transitions between states denote the pruning process. In the end, our method is able to implicitly select the proper number of channels in each layer by the Markov process with optimized transitions. To validate the effectiveness of our method, we perform extensive experiments on Imagenet with ResNet and MobilenetV2. Results show our method can achieve consistent improvement than state-of-the-art pruning methods in various FLOPs settings.



Method

framework

The channel pruning is first formulated as a Markov process parameterized by architecture parameters and can be optimized in an end-to-end manner. Then the training procedure of DMCP can be divided into two stages: in stage 1, the unpruned network is updated by our proposed variant sandwich rule, while in stage 2, the architecture parameters are wrapped into the unpruned network and get updated, as shown in Figure (a). After the optimization, we propose two ways to sample the pruned network.



Main Experiment Results on ImageNet

Group Model FLOPs Top-1 Δ Top-1
MobileNet-V2 Uniform 1.0x 300M 72.3 -
Uniform 0.75x 210M 70.1 -2.2
Uniform 0.5x 970M 64.8 -7.5
Uniform 0.35x 59M 60.1 -12.2
MetaPruning 217M 71.2 -0.8
87M 63.8 -8.2
43M 58.3 -13.7
AMC 211M 70.8 -1.0
AutoSlim* 300M 74.2 +2.4
211M 73.0 +1.2
DMCP 300M 73.5 +1.2
211M 72.2 -0.1
97M 67.0 -5.3
87M 66.1 -6.2
59M 62.7 -9.6
43M 59.1 -13.2
DMCP* 300M 74.6 +2.3
211M 73.5 +1.2
ResNet-18 Uniform 1.0x 1.8G 70.1 -
FPGM 1.04G 68.4 -1.9
DMCP 1.04G 69.2 -0.9
ResNet-50 Uniform 1.0x 4.1G 76.6 -
Uniform 0.85x 3.0G 75.3 -1.3
Uniform 0.75x 2.3G 74.6 -2.0
Uniform 0.5x 1.1G 71.9 -4.7
Uniform 0.25x 278M 63.5 -13.1
FPGM 2.4G 75.6 -0.6
SFP 2.4G 74.6 -2.0
MetaPruning 3.0G 76.2 -0.4
2.3G 75.4 -1.2
1.1G 73.4 -3.2
AutoSlim* 3.0G 76.0 -0.6
2.0G 75.6 -1.0
1.1G 74.0 -2.6
DMCP 2.8G 76.7 +0.1
2.2G 76.2 -0.4
1.1G 74.0 -2.6
278M 66.4 -10.0

We compare our method with various pruning methods, including reinforcement learning method AMC, evolution method MetaPruning, one-shot method AutoSlim, and traditional channel pruning methods SFP and FPGM. ll methods are evaluated on MobileNetV2, ResNet18, and ResNet50, in each type of model, we trained a set of baseline model for comparison. From the Table, we can see that our method outperforms all other methods under the same settings, which show the superiority of our method. Note that AMC, MetaPruning and our method train the pruned model from scratch by standard hard label loss.

While AutoSlim adopts a in-place distillation method in which the pruned network share weights with unpruned net and mimic the output of the unpruned net. The groups marked by *> indicate the pruned model is trained by the in-place distillation method. To fairly compare with AutoSlim, we also train our pruned model with the same method. Results show that this training method can further boost the performance, and our method surpasses AutoSlim in different FLOPs models.