Cross-supervised Crowd Counting via Multi-scale Channel Attention

Authors

  • Kexin Yan Shenyang Jianzhu University, School of Computer Science and Engineering, Liaoning, China; Liaoning Province Big Data Management and Analysis Laboratory of Urban Construction, Liaoning, China; Shenyang Branch of National Special Computer Engineering Technology Research Center, Liaoning, China
  • Fangjun Luan Shenyang Jianzhu University, School of Computer Science and Engineering, Liaoning, China; Liaoning Province Big Data Management and Analysis Laboratory of Urban Construction, Liaoning, China; Shenyang Branch of National Special Computer Engineering Technology Research Center, Liaoning, China
  • Shuai Yuan Shenyang Jianzhu University, School of Computer Science and Engineering, Liaoning, China; Liaoning Province Big Data Man-agement and Analysis Laboratory of Urban Construction, Liaoning, China; Shenyang Branch of National Special Computer Engineering Technology Research Center, Liaoning, China
  • Guoqi Liu Shenyang Jianzhu University, School of Computer Science and Engineering, Liaoning, China; Liaoning Province Big Data Management and Analysis Laboratory of Urban Construction, Liaoning, China; Shenyang Branch of National Special Computer Engineering Technology Research Center, Liaoning, China

DOI:

https://doi.org/10.5755/j01.itc.53.3.35805

Keywords:

Crowd counting, Multi-scale, channel attention, Transformer, Computer vision

Abstract

Due to the challenges posed by large-scale variability in crowd images and overlapping and occlusion of people in high-density regions, traditional CNNs with fixed-size convolution kernels or transformers lacking 2D locality and channel adaptation need to struggle to cope with this challenge. While Transformers have a global receptive field for long sequence tasks, CNNs exhibit better generalization and 2D locality. In order to combine the advantages of both approaches, this paper proposes a dual-branch multi-scale attention network (DBMSA-Net). First of all, we propose a multi-scale channel attention convolution module to extract features at different scales while enhancing channel adaptation. Furtherly, local features are augmented using a feed-forward neural network that is more suitable for visual tasks. Then an efficient lightweight multi-scale regression head is employed to predict density maps. Finally, progressive cross-head supervision is introduced as a loss function to dynamically supervise instance labels noise and mitigate its effect. Extensive experiments are conducted on three crowd counting datasets (ShanghaiTech Part A, ShanghaiTech Part B, UCF‐QNRF) to validate the effectiveness of the proposed method and the results show that DBMSA-Net outperforms state-of-the-art methods.

Downloads

Published

2024-09-25

Issue

Section

Articles