DMCP

DMCP: Differentiable Markov Channel Pruning for Neural Networks

CVPR 2020

Shaopeng Guo

Yujie Wang

Quanquan Li

Junjie Yan

[Paper]

[Code]

[Talk]

Abstract

Recent works imply that the channel pruning can be regarded as searching optimal sub-structure from unpruned networks. However, existing works based on this observation require training and evaluating a large number of tructures, which limits their application. In this paper, we propose a novel differentiable method for channel pruning, named Differentiable Markov Channel Pruning (DMCP), to efficiently search the optimal sub-structure. Our method is differentiable and can be directly optimized by gradient descent with respect to standard task loss and budget regularization (e.g. FLOPs constraint). In DMCP, we model the channel pruning as a Markov process, in which each state represents for retaining the corresponding channel during pruning, and transitions between states denote the pruning process. In the end, our method is able to implicitly select the proper number of channels in each layer by the Markov process with optimized transitions. To validate the effectiveness of our method, we perform extensive experiments on Imagenet with ResNet and MobilenetV2. Results show our method can achieve consistent improvement than state-of-the-art pruning methods in various FLOPs settings.

Method

The channel pruning is first formulated as a Markov process parameterized by architecture parameters and can be optimized in an end-to-end manner. Then the training procedure of DMCP can be divided into two stages: in stage 1, the unpruned network is updated by our proposed variant sandwich rule, while in stage 2, the architecture parameters are wrapped into the unpruned network and get updated, as shown in Figure (a). After the optimization, we propose two ways to sample the pruned network.

Main Experiment Results on ImageNet

Group	Model	FLOPs	Top-1	Δ Top-1
MobileNet-V2	Uniform 1.0x	300M	72.3	-
	Uniform 0.75x	210M	70.1	-2.2
	Uniform 0.5x	970M	64.8	-7.5
	Uniform 0.35x	59M	60.1	-12.2
	MetaPruning	217M	71.2	-0.8
		87M	63.8	-8.2
		43M	58.3	-13.7
	AMC	211M	70.8	-1.0
	AutoSlim*	300M	74.2	+2.4
	AutoSlim*	211M	73.0	+1.2
	DMCP	300M	73.5	+1.2
		211M	72.2	-0.1
		97M	67.0	-5.3
		87M	66.1	-6.2
		59M	62.7	-9.6
		43M	59.1	-13.2
	DMCP*	300M	74.6	+2.3
	DMCP*	211M	73.5	+1.2
ResNet-18	Uniform 1.0x	1.8G	70.1	-
	FPGM	1.04G	68.4	-1.9
	DMCP	1.04G	69.2	-0.9
ResNet-50	Uniform 1.0x	4.1G	76.6	-
	Uniform 0.85x	3.0G	75.3	-1.3
	Uniform 0.75x	2.3G	74.6	-2.0
	Uniform 0.5x	1.1G	71.9	-4.7
	Uniform 0.25x	278M	63.5	-13.1
	FPGM	2.4G	75.6	-0.6
	SFP	2.4G	74.6	-2.0
	MetaPruning	3.0G	76.2	-0.4
		2.3G	75.4	-1.2
		1.1G	73.4	-3.2
	AutoSlim*	3.0G	76.0	-0.6
		2.0G	75.6	-1.0
		1.1G	74.0	-2.6
	DMCP	2.8G	76.7	+0.1
		2.2G	76.2	-0.4
		1.1G	74.0	-2.6
		278M	66.4	-10.0

We compare our method with various pruning methods, including reinforcement learning method AMC, evolution method MetaPruning, one-shot method AutoSlim, and traditional channel pruning methods SFP and FPGM. ll methods are evaluated on MobileNetV2, ResNet18, and ResNet50, in each type of model, we trained a set of baseline model for comparison. From the Table, we can see that our method outperforms all other methods under the same settings, which show the superiority of our method. Note that AMC, MetaPruning and our method train the pruned model from scratch by standard hard label loss.

While AutoSlim adopts a in-place distillation method in which the pruned network share weights with unpruned net and mimic the output of the unpruned net. The groups marked by *> indicate the pruned model is trained by the in-place distillation method. To fairly compare with AutoSlim, we also train our pruned model with the same method. Results show that this training method can further boost the performance, and our method surpasses AutoSlim in different FLOPs models.