Estimating human body pose and shape from a single-view image has been highly successful, but most existing methods require a model with a large number of parameters that are difficult to run on low performance devices. Light weight networks are struggle to extract sufficient information for human pose and shape estimation, making accurate prediction challenging. In this paper, we propose a lightweight model for predicting human body shape and pose parameters of a parametric human body model. Our method comprises a lightweight multi-stage encoder based on Litehrnet and Shufflenet, and a decoder composed of cascaded MLPs based on human kinematic tree, which achieves comparable performance to HMR while the model size is only one-ninth of HMR. In addition, our model can achieve an inference speed of 19.2 times per second on the Qualcomm Snapdragon 888+
KEYWORDS: Feature extraction, Image restoration, Visual process modeling, Image compression, Object detection, Machine vision, Education and training, Semantics, Image segmentation, Human vision and color perception
Latent representation features in deep learning (DL) exhibit excellent potential for visual data applications. For example, in traffic monitoring and video surveillance, the features simultaneously perform image analysis for machine vision and image reconstruction for human viewing. However, the existing deep features that appeal to machine and human receivers are always combinations of separated pieces and specific features. Due to these features being extracted from different branches in collaboration frameworks, the inherent relations between machine and human vision are insufficiently explored. Therefore, to obtain one set of representative and generic features, we propose a dynamic groupwise splitting network based on image content to explore and extract generic features for the two different receivers. First, we analyze the characteristics of the latent features and adopt intermediate features as the base features. Then, a feature classification and transformation mechanism based on image content is proposed to enhance the base features for further image reconstruction and analysis. Consequently, an end-to-end model with multimodel cascading and multistage training realizes both machine and human vision tasks. Extensive experiments show that our human–machine vision collaboration framework has high practical value and performance.
Objective quality assessment plays a vital role in the evaluation and optimization of panoramic video. However, most of the current methods only consider the structural distortion caused by the projection format, and do not consider the effect of clarity on quality evaluation. For this reason, we propose a new objective video quality assessment method for panoramic video. First, the source image and the distorted image are down-sampled to obtain five sets of images with different scales. Second, calculate WS-SSIM at different scales. Finally, according to the degree of influence of different scales on the subjective evaluation, different coefficients are assigned to the corresponding WS-SSIM, and the overall score is calculated. Experiments on the database established in our laboratory have proved its effectiveness through comparison.
Virtual reality (VR) refers to a technology that allows people to experience the virtual world in an artificial environment. As one of the most important forms of VR media content, panoramic video can provide viewers with 360-degree free viewing angles. However, the acquisition, stitching, transmission and playback of panoramic video may damage the video quality and seriously affect the viewer's quality of experience. Therefore, how to improve the display quality and provide users with a better visual experience has become a hot topic in this field. When watching the videos, people pay attention to the salient areas, especially for the panoramic videos that people can choose the regions of interest freely. Considering this characteristic, the saliency information needs to be utilized when performing quality assessment. In this paper, we use two cascaded networks to calculate the quality score of panoramic video without reference video. First, the saliency prediction network is used to compute the saliency map of the image, and the patches with higher saliency are selected through the saliency map. In this way, we can exclude the areas in the panoramic image that have no positive effect on the quality assessment task. Then, we input the selected small salient patches into the quality assessment network for prediction, and obtain the final image quality score. Experimental results show that the proposed method can achieve more accurate quality scores for the panoramic videos compared with the state-of-the-art works due to its special network structure.
Collaborative intelligence is a new strategy to deploy deep neural network model for AI-based mobile devices, which runs a part of model on the mobile to extract features, the rest part in the cloud. In such case, feature data but not the raw image needs to be transmitted to cloud, and the features uploaded to cloud need have generalization capability to complete multitask. To this end, we design an encoder-decoder network to get intermediate deep features of image, and propose a method to make the features complete different tasks. Finally, we use a lossy compression method for intermediate deep features to improve transmission efficiency. Experimental results show that the features extracted by our network can complete input reconstruction and object detection simultaneously. Besides, with the deep-feature compression method proposed in our work, the quality of reconstructed image is good in visual and index of quantitative assessment, and object detection also has a good result in accuracy.
The video quality assessment (VQA) technology has attracted a lot of attention in recent years due to an increasing demand of video streaming services. Existing VQA methods are designed to predict video quality in terms of the mean opinion score (MOS) calibrated by humans in subjective experiments. However, they cannot predict the satisfied user ratio (SUR) of an aggregated viewer group. Furthermore, they provide little guidance to video coding parameter selection, e.g. the Quantization Parameter (QP) of a set of consecutive frames, in practical video streaming services. To overcome these shortcomings, the just-noticeable-difference (JND) based VQA methodology has been proposed as an alternative. It is observed experimentally that the JND location is a normally distributed random variable. In this work, we explain this distribution by proposing a user model that takes both subject variabilities and content variabilities into account. This model is built upon user’s capability to discern the quality difference between video clips encoded with different QPs. Moreover, it analyzes video content characteristics to account for inter-content variability. The proposed user model is validated on the data collected in the VideoSet. It is demonstrated that the model is flexible to predict SUR distribution of a specific user group.
Video-based facial expression recognition has become increasingly important for plenty of applications in the real world. Despite that numerous efforts have been made for the single sequence, how to balance the complex distribution of intra- and interclass variations well between sequences has remained a great difficulty in this area. We propose the adaptive (N+M)-tuplet clusters loss function and optimize it with the softmax loss simultaneously in the training phrase. The variations introduced by personal attributes are alleviated using the similarity measurements of multiple samples in the feature space with many fewer comparison times as conventional deep metric learning approaches, which enables the metric calculations for large data applications (e.g., videos). Both the spatial and temporal relations are well explored by a unified framework that consists of an Inception-ResNet network with long short term memory and the two fully connected layer branches structure. Our proposed method has been evaluated with three well-known databases, and the experimental results show that our method outperforms many state-of-the-art approaches.
KEYWORDS: 3D image processing, Image compression, Statistical analysis, Video coding, 3D displays, Visualization, Error analysis, Computer programming, Prototyping, Imaging systems
Three-dimensional (3-D) holoscopic imaging, also known as integral imaging, light field imaging, or plenoptic imaging, can provide natural and fatigue-free 3-D visualization. However, a large amount of data is required to represent the 3-D holoscopic content. Therefore, efficient coding schemes for this particular type of image are needed. A 3-D holoscopic image coding scheme with kernel-based minimum mean square error (MMSE) estimation is proposed. In the proposed scheme, the coding block is predicted by an MMSE estimator under statistical modeling. In order to obtain the signal statistical behavior, kernel density estimation (KDE) is utilized to estimate the probability density function of the statistical modeling. As bandwidth estimation (BE) is a key issue in the KDE problem, we also propose a BE method based on kernel trick. The experimental results demonstrate that the proposed scheme can achieve a better rate-distortion performance and a better visual rendering quality.
KEYWORDS: Distortion, Volume rendering, Image compression, Video coding, Quantization, Video, 3D video compression, 3D image processing, Video compression, Communication engineering
In multi-view plus depth (MVD) 3D video coding, texture maps and depth maps are coded jointly. The depth maps
provide the scene geometry information and are used to render the virtual view at the terminal through a
Depth-Image-Based-Rendering (DIBR) technique. The distortion of the coded texture maps and depth maps will induce
synthesized virtual view distortion. Besides the coding efficiency of texture maps and depth maps, bit allocation between
texture maps and depth maps also has a great effect on the virtual view quality. In this paper, the virtual view distortion
is divided into texture maps induced distortion and depth maps induced distortion separately, models of texture maps
induced virtual view distortion and depth maps induced virtual view distortion are derived respectively. Based on the
depth maps induced virtual view distortion model, depth maps coding Rate Distortion Optimization (RDO) is modified
and the depth maps coding efficiency is increased. Meanwhile, we also propose a Rate-distortion (R-D) model to solve
the joint bit allocation problem. Experimental results demonstrate the high accuracy of the proposed virtual view
distortion model. The R-D performance of the proposed algorithm is close to the full search algorithm that can give the
best R-D performance, while the coding complexity of the proposed algorithm is lower. Compared with fixed texture and
depth bits ratio (5:1), an average 0.3 dB gains can be achieved by the proposed algorithm. The proposed algorithm has
high rate control accuracy with the average error less than 1%.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.