We present a robust facial landmark detection network based on multiscale attention residual blocks (MARBNet) for effectively predicting facial landmark. MARBNet consists of three modules. Firstly, the coarse feature extraction module obtains coarse features through convolution, batch normalization, ReLU activation, and maximum pooling. The fine feature extraction module is composed of 33 multiscale attention residual blocks (MARB). MARB is composed of 1x1 convolution layer, 3x3 convolution layer, 1x1 convolution layer, two multiscale convolution module(MulRes) and channel attention module(CAM). MulRes is used to extract complementary features of different scales, obtain more feature information under different Receptive field, and avoid excessive loss of key information in the input image. CAM enables the network to pay more attention to high-frequency information on the channel, effectively prevents the loss of information, so as to improve the effect of facial landmark detection. The output module consists of two 1x1 convolution layers, one of which outputs landmark heatmap score and landmark coordinate offset, and the other outputs the nearest neighbor landmark offset. The experiment results on WFLW and 300W datasets show that our method is superior to the existing algorithms in terms of normalized mean square error indicators.
We present a two-dimensional human pose estimation network constrained by the human structure information (HSINet). HSINet effectively fuses features of different scales and explicitly integrates human structure information to enhance the precision of key point localization. The architecture of HSINet comprises three pivotal modules: the feature extraction module, the encoding module, and the decoding module. The feature extraction module within HSINet employs the architecture of High-Resolution Net (HRNet). In contrast to HRNet, we remove redundant layers, and enhance the ability to combine global features and local features using the Gated Attention Unit (GAU). The second module encodes the feature maps derived from the feature extraction module. Each feature map corresponds to a joint point and is characterized by two feature vectors representing the x and y axes. Utilizing graph convolution for encoding introduces constraints based on human structure information. Subsequently, these encoded feature maps are decoded into precise coordinates of key points. The experiment results on COCO datasets show that our proposed method can improve the precision of key point detection while effectively reducing the number of parameters.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.