Special Section on Color in Texture and Material Recognition

Facial expression recognition in the wild based on multimodal texture features

[+] Author Affiliations
Bo Sun, Liandong Li, Guoyan Zhou, Jun He

Beijing Normal University, College of Information Science and Technology, No. 19, XinJieKouWai Street, Beijing 100875, China

J. Electron. Imaging. 25(6), 061407 (Jun 22, 2016). doi:10.1117/1.JEI.25.6.061407
History: Received February 11, 2016; Accepted June 1, 2016
Text Size: A A A

Open Access Open Access

Abstract.  Facial expression recognition in the wild is a very challenging task. We describe our work in static and continuous facial expression recognition in the wild. We evaluate the recognition results of gray deep features and color deep features, and explore the fusion of multimodal texture features. For the continuous facial expression recognition, we design two temporal–spatial dense scale-invariant feature transform (SIFT) features and combine multimodal features to recognize expression from image sequences. For the static facial expression recognition based on video frames, we extract dense SIFT and some deep convolutional neural network (CNN) features, including our proposed CNN architecture. We train linear support vector machine and partial least squares classifiers for those kinds of features on the static facial expression in the wild (SFEW) and acted facial expression in the wild (AFEW) dataset, and we propose a fusion network to combine all the extracted features at decision level. The final achievement we gained is 56.32% on the SFEW testing set and 50.67% on the AFEW validation set, which are much better than the baseline recognition rates of 35.96% and 36.08%.

Figures in this Article

With the development of artificial intelligence and affective computing, facial expression recognition has shown prospects in human–computer interfaces, online education, entertainment, intelligent environments, and so on. In past years, much research has been done on the data collected in strictly controlled laboratory settings with frontal faces, perfect illumination, and posed expressions. As the application environment turns into a real world scenario, those methods using the monomial feature such as local binary patterns (LBP)1 or bag of visual words2 cannot achieve promising results. In addition, unlike the lab-controlled dataset, human heads in a real environment can be in any position of an image with all sorts of angles and poses. So, for most automatic facial expression recognition methods, the first step is to locate and extract the position of a face in the whole scene. The traditional way of this progress is always to combine the Viola–Jones face detector and the Haar-cascade eye detector.3 Recently, some methods, such as mixture of parts (MoPs)4 and supervised descent method,5 have robust face detection results in various head rotations.

To explore facial expression recognition in the real world, we do experiments on three public datasets: acted facial expression in the wild (AFEW), static facial expression in the wild (SFEW), and facial expression recognition (FER). The AFEW database6 consists of short video clips extracted from popular Hollywood movies. Each clip contains a film actor who has been labeled into one of the seven basic facial expression categories, namely Anger, Disgust, Fear, Happiness, Neutral, Sadness, and Surprise. The AFEW set has 711 training videos, 371 validation videos, and 539 test videos. We only know the labels of the training and validation sets, specific numbers of which are shown in Table 1. The SFEW database7 is almost the same as that of the AFEW set, except that it consists of static frames of the movies. Both of the datasets are very challenging for traditional facial expression recognition methods due to the complicated scenes of films, which can be seen from the uncompromising baseline recognition rate of 36.08% and 35.96%. The SFEW set consists of 891,427, and 372 RGB color images for training, validation, and testing, respectively. Samples of expression data are shown in Fig. 1. The FER-2013 dataset8 is a facial expression dataset created using the Google image search application programming interface to search for images of faces that match a set of 184 emotion-related keywords such as “blissful” and “enraged.” It has 28,709 gray images for training and 7178 images for validation. On the FER dataset, the human accuracy was 68±5%.

Graphic Jump Location
Fig. 1
F1 :

Samples of facial expression data of SFEW. The expressions shown are from the first line left to second line right, anger, disgust, fear, happiness, neutral, sadness, and surprise. The image data are quite different in the illumination status and character postures.

Table Grahic Jump Location
Table 1The number of data for each expression in AFEW, SFEW, and FER dataset.

In our proposed method, openly available tools such as MoPS4 and Intraface5 are used for face detection and alignment. For facial expression, we employ the descriptors of LBP,1 local phase quantization (LPQ),9 histogram of oriented gradients (HOG),10 and dense scale-invariant feature transform (SIFT).2 We also design a deep convolutional neural network (CNN)11 for feature learning and compare the recognition results between gray data and color data. Then, we propose a fusion network for classification, which is a decision-level fusion method for improving the result. Our fusion network fuses different features and gains a promising recognition performance. We also compare the result of it with that of other state-of-the art fusion methods.

The rest of this paper is organized as follows: In Sec. 2, we review the related works. The facial image extraction progress is shown in Sec. 3. Section 4 details the deep features and handcrafted features we explored. Section 5 gives the definitions of the proposed feature fusion network. Section 6 gives the experiments we have done, in which the feature components and the recognition results on three datasets are available. Then, the final conclusion is given in Sec. 7.

There are many researches focusing on recognizing facial expression. Ekman and Friesen12 defined facial action coding system action units for manual facial expression analysis. Zhao and Pietikainen1 proposed a volume local texture feature LBP-TOP and achieved remarkable facial expression recognition results in a laboratory. Kahou et al.13 used convolutional neural network and deep belief network and got the top performance in the EmotiW 2013 Challenge. Liu et al.14 used Grassmannian Manifold to get facial expression features, then they combined Riemannian Manifold and deep convolutional neural network in Ref. 15. Yao et al.16 combined the CNN model with facial action unit aware features and got the state-of-the-art result for facial expression recognition in videos. Kim et al.17 explored several CNN architectures and data preprocessing methods. Yu and Zhang18 used a data disturb method to enhance data. Liu et al.19 proposed a boosted deep belief network for facial expression recognition and got promising results on some laboratory recorded datasets. Ng et al.20 explored transfer learning for deep models including VGG and AlexNet.21

Since no feature descriptor can handle the problem of facial expression recognition in the wild alone, the fusion method can be used to combine multimodal features. Sikka et al.22 explored the fusion way of general multiple kernel learning (GMKL) and multi-label multiple kernel learning. Chen et al.23 used the SimpleMKL method to combine visual and acoustic features. Kim et al.17 proposed a committee machine method to combine 108 CNN models in. Kahou et al.24 proposed a voting matrix and used random search to tune the fusion weight parameters. They used the multilayer perceptron in Ref. 25 to combine neural networks at the feature level. Gönen and Alpaydın26 reviewed quite a few kinds of multiple kernel methods for the common pattern recognition problem. Bucak et al.27 reviewed the state-of-the-art for multiple kernel learning (MKL), with the focus on the applications of object recognition.

We follow the face extraction and tracking method of Sikka et al.2 and Dhall et al.28 For the continuous facial expression recognition, the mixture of tree structured part model (MoPS)4 face detector is used to detect the position of a face in the first frame of a video. Then, the IntraFace toolkit used the supervised descent method5 to track 49 facial landmarks of the rest of the frames in a parameterized appearance model. All frames of the AFEW dataset are aligned to a base face through affine transformation and cut to 128×128  pixels.

For the static facial expression recognition, the MoPS and OpenCV29 detectors are used for SFEW and FER, respectively. Facial landmarks generated by MoPS are used to align faces for handcrafted features extraction. For deep CNN features that are robust to the poses of faces, only coarse face alignment is performed, by keeping the center of facial landmark points or bounding boxes at the middle of images. All face images are resized to 48×48  pixels for deep feature learning. For handcrafted features, the image size is set to 128×128. As illumination and brightness changes appeared frequently in the SFEW dataset, we evaluate the min–max normalization as image preprocessing method.

Feature Learning

The deep CNN11 is a popular type of model in the community of computer vision. We deploy two kinds of CNN architectures, the AlexNet and regions CNN (RCNN). The AlexNet21 is a nine-layers deep model designed for object recognition of ImageNet dataset,30 using rectified linear unit as activation function. The AlexNet model has five convolutional layers and three fully connection layers. It introduces data enlarge strategy, local normalization, and dropout method to avoid over-fitting. The RCNN31 is a type of deep learning architecture that combines object detection with object recognition. This model can detect the object in a scene and then use the CNN feature for classification. These two models are all pretrained on the ImageNet dataset.

Based on the AlexNet, we design a deep CNN architecture for facial expression recognition. The whole architecture of our model is shown in Fig. 2. First, the facial images are cropped from four corners and the center and flipped to 10 patches of 40×40. Then, the first convolutional layer filters the 40×40 input patch with 64 kernels of size 5×5. The second convolutional layer takes as input the response-normalized and max-pooled output of the first convolutional layer and filters it with 64 kernels of size 3×3×64. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 128 kernels of size 3×3×64 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth and fifth convolutional layers both have 128 kernels of size 3×3×128. The fully connected (FC) layers have 1024 neurons each. The rectified linear unit activations are applied to the output of every convolutional or fully connected layer. For validation of the training progress, the softmax regression is used as the output layer. For feature extraction, we use the last FC layer as the output. In our experiments, we visualize the activation values of the first convolutional layers of the AlexNet and our proposed CNN, which are shown in Fig. 3. We can see that some feature maps of the AlexNet are not activated in the task of expression recognition. This is reasonable since the AlexNet is trained on the ImageNet dataset, which makes its feature contain more information than human facial expression.

Graphic Jump Location
Fig. 2
F2 :

Deep CNN architecture for feature learning.

Graphic Jump Location
Fig. 3
F3 :

Learnt features of the first convolutional layers. The left one (a) belongs to the AlexNet while the right one (b) belongs to our proposed CNN.

Handcrafted Features

For images of SFEW dataset, we extract LBP, dense SIFT, and deep CNN features. For video clips of AFEW dataset, we extract volume features such as LBP-TOP, LPQ-TOP and pooling the dense SIFT, HOG and DCNN features through the image sequences of a video. In addition, we also design two temporal–spatial features: SIFT-TOP and SIFT-LBP. The pipeline of extracting these handcrafted features is as follows: on the face images extracted from a video, alignment through facial landmark points and spatial pyramid matching (SPM) are performed, and then features are encoded after extraction. The pipeline is shown in Fig. 4.

Graphic Jump Location
Fig. 4
F4 :

Pipeline of handcrafted features extraction. The dashed box means that the temporal–spatial representation is only used for AFEW dataset.

Image descriptors

The LBP descriptor is an efficient representation of facial image texture, and has been successfully applied to facial expression recognition.1 It can be represented as follows: Display Formula

d=pi=1k2i1I(Op,Ni).(1)

In Eq. (1), I(O,N) means the Boolean comparison between a pixel Op and its neighboring pixels N which has a total number of K. The binary labels form a local binary pattern d over the whole p pixels of an image.

The LPQ9 descriptor is calculated based on computing short-term Fourier transform on local image window. The descriptor utilizes phase information computed locally in a window for every image position. The phases of the four low-frequency coefficients are decorrelated and uniformly quantized in an eight-dimensional space.

The HOG10 is implemented by dividing the image window into small spatial regions, each region accumulating a local one-dimensional histogram of gradient directions or edge orientations over the pixels of the region. The combined histogram entries form the representation.

The dense SIFT feature32 is to perform SIFT descriptor on a dense gird of locations at a fixed scale and orientation. The SIFT descriptor associates to the gird a signature that identifies its appearance compactly and robustly. The dense SIFT feature characterizing appearance information is often used for categorization task.

Feature encoding and pyramid matching

For LBP and LPQ descriptor, histograms of all binary codewords are formed to encode the final image features. Take note that only the statistics of 59 uniform local binary patterns1 are considered. For dense SIFT descriptor, the bag of words model has shown remarkable performance on facial expression recognition.22 First, we extract multiscale dense SIFT descriptors32 from 100 randomly picked image samples. Then, 800 clustering centers are constructed using approximate K-means clustering algorithm. The number 800 is chosen throughout the experiments. Then, the whole data sets’ dense SIFT descriptors are encoded using the locality-constrained linear coding (LLC),33 which can guarantee the sparsity and locality of the coded words.

In our experiments, we tried spatial pyramid matching34 for the handcrafted descriptors. Experimental results show that spatial pyramid matching can add recognition accuracy by providing more spatial information to the final features. The number of layers of LBP, LPQ, and dense SIFT are 4, 4, and 5, respectively.

Temporal–Spatial Representation

For continuous facial expression recognition, the image feature has to be extended to temporal–spatial area. After getting the image features of all image frames of a video clip, max pooling is usually used to aggregate all frame features into one video feature. Though this is still decent performance, it actually loses much detailed temporal information of a video. Based on deep analysis on our experiments, we add temporal information through extracting LBP descriptors on the XT and YT planes (in which T stands for the time domain) of a video, and combine it with the dense-SIFT feature of XY plane (i.e., the image space) (SIFT-LBP), shown as Fig. 5. LBP descriptors of XT and YT frames are encoded to XT histogram and YT histogram, after spatial pyramid matching. Bag of multiscale dense SIFT feature is extracted from every XY frame following the pipeline described in Sec. 4.2.2. We also explore how to directly extract dense SIFT feature on the three orthogonal planes of XY, XT, and YT (SIFT-TOP). Our experiment shows that the new temporal–spatial descriptor, namely SIFT-LBP, has better performance. We also explore how to use a deep learnt feature for temporal–spatial representation, which is accomplished by taking the maximum pooling value of the CNN feature vectors over all frames. Unfortunately, the recognition result is uncompromising on the AFEW dataset.

Graphic Jump Location
Fig. 5
F5 :

Our proposed SIFT-LBP temporal–spatial representation for video.

Classifiers
Support vector machine

The features we extract are all linearly separable under ideal conditions. So, we use linear support vector machine (SVM) as basic classifiers. Given a training set of L data points (xi,yi), i=1,,l, xiRn, yi{1,+1}, the support vector classifier solves the following unconstrained optimization problem:35Display Formula

min12θTθ+Ciξ(θ;θi,yi),(2)
where C is the penalty parameter and ξ(θ;θi,yi)=max(1yiθTxi,0)2 is the loss function. For testing data x, SVM predicts it as positive if θTx>0, and negative otherwise. Here, we use the SVM decision value DSVM=θTx as the input for the next fusion process. As SVM is a binary classifier, we follow one-versus-rest strategy, which classifies the data points between one category and the rest one at a time.

Partial least squares regression

Partial least squares (PLS) regression is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of minimum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. According to Ref. 14 given a feature set XRn with label Y, the PLS classifier decomposes these variables into: Display Formula

X=UxVxT+rx,
Display Formula
Y=UyVyT+ry,(3)
where Ux and Uy contain the extracted score vectors, Vx and Vy are orthogonal loading matrices, and rx and ry are residuals. PLS tries to find the optimal weights wx and wy to get the maximum covariance such that Display Formula
[cov(ux,uy)]2=max|w|=|v|=1[cov(Xwx+Ywy)]2.(4)

Then, we can get the regression coefficients β as Display Formula

β=XTUy(UxTXXTUx)1UxTY.(5)

The PLS decision value can be estimated by DPLS=Xβ. Like in Sec. 5.1, we follow one- versus-rest strategy for the multiclass classification.

Fusion Network

As different features have different discriminative abilities on specific emotions,36 we propose a fusion network as shown in Fig. 6 to combine the results of each classifier.

Graphic Jump Location
Fig. 6
F6 :

Layers of proposed fusion network.

Given m features and n classes, the SVM or PLS classifiers generate m×n decision values, which can be denoted as a(j,k)=θjkTxj, j=1,,m, k=1,,n. Then, they are used as the input for the fusion network. For input a, we use a hypothesis function hw(a)Display Formula

hw(a(i))=[p(y(i)=1|a(i);W)p(y(i)=2|a(i);W)p(y(i)=n|a(i);W)]=1k=1nej=1mWkjTa(i)[ej=1mWj1Taj1(i)ej=1mWj2Taj2(i)ej=1mWjnTajn(i)](6)
to estimate P(y=k|a), which represents the probability of the class label y taking on each of the n different possible values. Here, W means m×n weights. Thus, the final output is an n dimensional vector, which represents n probabilities. The final prediction is using a max-win strategy to choose the most likely label.

We use a loss function J(W) for optimization. The gradient descent method is applied to get the optimized values of W by updating W to WWJ(W) at every iteration Display Formula

J(W)=1L[i=1Lk=1m1{y(i)=k}logej=1mWjkTajk(i)k=1nej=1mWjkTajk(i)]+λ2j=1mk=1nWjk2,(7)
Display Formula
WkJ(W)=1Li=1L[x(i)(1{y(i)=k}p(y(i)=k|a(i);W))]+λWk,(8)
where L is the number of training examples, λ is the L2-norm parameter, 1{·} is the indicator function, which means 1{a true statement}=1, and 1{a false statement}=0.

In experiments, we try to fuse the decision values of SVM and PLS classifiers. We find that this kind of fusion network performs better than the SVM-only fusion network.

Deep Feature Learning of Color and Gray Images

For deep feature learning, we employ the Caffe37 implementation, which is commonly used in several recent works. To pretrain the CNN model according to our proposed architecture, we use expression images from the FER dataset. The base learning rate is set to 0.005, which will be divided by 10 after every 10,000 iterations. In each iteration, 256 samples are used for stochastic gradient optimization. After 200 epoch’s training, our proposed CNN gets 67.82% on the FER validation set. Then, we fine-tune the model on the SFEW set. The base learning rate is changed to 0.001. After 300 epoch’s fine-tuning, the validation accuracy is converged. The experiment results are shown in Table 2. We can see that the RGB color CNN model with min–max normalization can achieve slightly better recognition result.

Table Grahic Jump Location
Table 2Comparison results of proposed CNN model, on color and gray image data.
Results of Monomial Feature

We extract the features listed in Sec. 4 and apply the SVM and PLS classifiers. Results are shown in Tables 3 and 4. On the SFEW dataset, through comparison experiments, we extract the last pooling layer’s activation value as the feature of AlexNet and RCNN. For our proposed CNN, the last fully connected layer’s output is extracted. We can see that using the SVM and PLS classifier can further improve the recognition result of the CNN model. On the AFEW dataset, as each frame produces a CNN, a dense SIFT and a HOG feature vector, information from all frames of a video are combined using pooling strategy, which is accomplished by taking the maximum or mean value of all feature vectors over all frames. By experiment, max pooling has better results for dense SIFT and HOG. The SVM classifiers all use linear kernels. Classification models are trained on training set and parameters are tuned on validation set through a fivefold cross validation in the range from 210 to 210. Results show that our proposed CNN feature and SIFT-LBP feature performs well on the SFEW and AFEW dataset, respectively.

Table Grahic Jump Location
Table 3Recognition accuracies on SFEW, C is the cost parameter of SVM, n is the PLS dimension. P represents the activation value of last pooling layer while FC means the activation value of last FC layer.
Table Grahic Jump Location
Table 4Recognition accuracies on AFEW.

For AFEW and SFEW datasets, we use four-Layer SPM for LBP and LPQ features. Each image is partitioned into 2l×2l segments at multiple scales l=1,2,4, and 8. For example, the dimension of SPM-LBPTOP is 15,045. Too much SPM layers mean lager dimension and it would be harder to be optimized for classification. While as dense SIFT uses LLC coding, five-layer SPM can achieve the best performance.

Fusion Results of Multimodal Features

Then, our proposed fusion network is performed to combine the classification results of these features. We train the fusion network on the validation set. The L2-norm parameter λ is chosen through a cross validation on the validation set. Fusion results are shown in Tables 5 and 6. Results show that our proposed method is better both on the validation set and testing set. We compare the fusion network with GMKL,38 SimpleMKL,39 and three other researcher’s work17,18,20 on the SFEW set. We can see that our fusion network outperforms other methods on the validation set. As the test labels of the AFEW and SFEW datasets are not publicly opened, we do not get final test results for all our methods. Despite that we can see that our proposed fusion network performs well and robust through cross validation. Note that some features perform better when classified by PLS, so the fusion network combining PLS and SVM together can achieve better results than using only SVM.

Table Grahic Jump Location
Table 5Fusion results on SFEW dataset. The SVM fusion network means the fusion of SVM results only. In the fusion network, AlexNet and RCNN features are classified by PLS.
Table Grahic Jump Location
Table 6Fusion results on AFEW dataset. In the fusion network, the LPQ-TOP is classified by PLS.

In this paper, we design some texture features for automatic human facial expression recognition in the real world. For each feature, we train individual SVM and PLS classifiers that have different discriminative ability for facial expressions classification. We propose a fusion network to utilize these feature characteristics. The method is evaluated on the AFEW and SFEW datasets and gains very promising achievement. In the future, we will try to deduce more kinds of temporal–spatial representation methods to further improve the continuous facial expression recognition result and investigate the use of component analysis methods to decrease the feature dimensions.

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61501035 and KJZXCJ2016042), the Fundamental Research Funds for the Central Universities of China (2014KJJCA15), and the National Education Science Twelfth Five-Year Plan Key Issues of the Ministry of Education (DCA140229).

Zhao  G., and Pietikainen  M., “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE Trans. Pattern Anal. Mach. Intell.. 29, (6 ), 915 –928 (2007). 0162-8828 CrossRef
Sikka  K.  et al., “Exploring bag of words architectures in the facial expression domain,” in  Computer Vision–ECCV 2012. Workshops and Demonstrations , pp. 250 –259,  Springer ,  Berlin Heidelberg  (2012).
Valstar  M. F.  et al., “The first facial expression recognition and analysis challenge,” in  IEEE Int. Conf. on Automatic Face & Gesture Recognition and Workshops (FG 2011) , pp. 921 –926,  IEEE  (2011).CrossRef
Zhu  X., and Ramanan  D., “Face detection, pose estimation, and landmark localization in the wild,” in  2012 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pp. 2879 –2886,  IEEE  (2012).CrossRef
Xiong  X., and De la Torre  F., “Supervised descent method and its applications to face alignment,” in  IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pp. 532 –539,  IEEE  (2013).CrossRef
Dhall  A.  et al., “Collecting large, richly annotated facial-expression databases from movies,” IEEE Multimedia. 19, (3 ), 34 –41 (2012).CrossRef
Dhall  A.  et al., “Static facial expression analysis in tough conditions: data, evaluation protocol and benchmark,” in  IEEE Int. Conf. on Computer Vision Workshops (ICCV Workshops) , pp. 2106 –2112,  IEEE  (2011).CrossRef
Goodfellow  I. J.  et al., “Challenges in representation learning: a report on three machine learning contests,” in Neural Information Processing. , and Lee  M.  et al., Eds., pp. 117 –124,  Springer ,  Berlin Heidelberg  (2013).
Päivärinta  J., , Rahtu  E., , Heikkilä  J., “Volume local phase quantization for blur-insensitive dynamic texture classification,” in Image Analysis. , , Heyden  A., and Kahl  F., Eds., pp. 360 –369,  Springer ,  Berlin Heidelberg  (2011).
Dalal  N., and Triggs  B., “Histograms of oriented gradients for human detection,” in  IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR 2005) , Vol. 1, pp. 886 –893,  IEEE  (2005).CrossRef
LeCun  Y., , Bengio  Y., and Hinton  G., “Deep learning,” Nature. 521, (7553 ), 436 –444 (2015).CrossRef
Ekman  P., and FriesenRosenberg  E. L., What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). ,  Oxford University Press  (1997).
Kahou  S. E.  et al., “Combining modality specific deep neural networks for emotion recognition in video,” in  Proc. of the 15th ACM on Int. Conf. on Multimodal Interaction , pp. 543 –550,  ACM  (2013).
Liu  M.  et al., “Partial least squares regression on Grassmannian manifold for emotion recognition,” in  Proc. of the 15th ACM on Int. Conf. on Multimodal Interaction , pp. 525 –530,  ACM  (2013).
Liu  M.  et al., “Combining multiple kernel methods on Riemannian manifold for emotion recognition in the wild,” in  Proc. of the 16th Int. Conf. on Multimodal Interaction , pp. 494 –501,  ACM  (2014).
Yao  A.  et al., “Capturing au-aware facial features and their latent relations for emotion recognition in the wild,” in  Proc. of the 2015 ACM on Int. Conf. on Multimodal Interaction , pp. 451 –458,  ACM  (2015).
Kim  B. K.  et al., “Hierarchical committee of deep convolutional neural networks for robust facial expression recognition,” J. Multimodal User Interfaces. 10, (2 ), 173 –189 (2016).CrossRef
Yu  Z., and Zhang  C., “Image based static facial expression recognition with multiple deep network learning,” in  Proc. of the 2015 ACM on Int. Conf. on Multimodal Interaction , pp. 435 –442,  ACM  (2015).
Liu  P.  et al., “Facial expression recognition via a boosted deep belief network,” in  IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2014) , pp. 1805 –1812,  IEEE  (2014).CrossRef
Ng  H. W.  et al., “Deep learning for emotion recognition on small datasets using transfer learning,” in  Proc. of the 2015 ACM on Int. Conf. on Multimodal Interaction , pp. 443 –449,  ACM  (2015).
Krizhevsky  A., , Sutskever  I., and Hinton  G. E., “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems. , pp. 1097 –1105 (2012).
Sikka  K.  et al., “Multiple kernel learning for emotion recognition in the wild,” in  Proc. of the 15th ACM on Int. Conf. on Multimodal Interaction , pp. 517 –524,  ACM  (2013).
Chen  J.  et al., “Emotion recognition in the wild with feature fusion and multiple kernel learning,” in  Proc. of the 16th Int. Conf. on Multimodal Interaction , pp. 508 –513,  ACM  (2014).
Kahou  S. E.  et al., “Combining modality specific deep neural networks for emotion recognition in video,” in  Proc. of the 15th ACM on International conference on multimodal interaction , pp. 543 –550,  ACM  (2013).
Kahou  S. E.  et al., “Recurrent neural networks for emotion recognition in video,” in  Proc. of the 2015 ACM on Int. Conf. on Multimodal Interaction , pp. 467 –474,  ACM  (2015).CrossRef
Gönen  M., and Alpaydın  E., “Multiple kernel learning algorithms,” J. Mach. Learn. Res.. 12, , 2211 –2268 (2011).
Bucak  S. S., , Jin  R., and Jain  A. K., “Multiple kernel learning for visual object recognition: a review,” IEEE Trans. Pattern Anal. Mach. Intell.. 36, (7 ), 1354 –1369 (2014). 0162-8828 CrossRef
Dhall  A.  et al., “Video and image based emotion recognition challenges in the wild: Emotiw 2015,” in  Proc. of the 2015 ACM on Int. Conf. on Multimodal Interaction , pp. 423 –426,  ACM  (2015).
Bradski  G., “The open CV library,” Doctor Dobbs J.. 25, (11 ), 120 –126 (2000).
Russakovsky  O.  et al., “Imagenet large scale visual recognition challenge,” Int. J. Comput. Vision. 115, (3 ), 211 –252 (2014). 0920-5691 
Girshick  R.  et al., “Rich feature hierarchies for accurate object detection and semantic segmentation,” in  IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2014) , pp. 580 –587,  IEEE  (2014).
Vedaldi  A., and Fulkerson  B., “VLFeat: an open and portable library of computer vision algorithms,” in  Proc. of the Int. Conf. Multimedia , pp. 1469 –1472,  ACM  (2010).
Wang  J.  et al., “Locality-constrained linear coding for image classification,” in  IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2010) , pp. 3360 –3367,  IEEE  (2010).CrossRef
Lazebnik  S., , Schmid  C., and Ponce  J., “Beyond bags of features: spatial pyramid matching for recognizing natural scene categories,” in  2006 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR'06) , Vol. 2, pp. 2169 –2178,  IEEE  (2006).CrossRef
Fan  R. E.  et al., “LIBLINEAR: a library for large linear classification,” J. Mach. Learn. Res.. 9, , 1871 –1874 (2008).CrossRef
Sun  B.  et al., “Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild,” in  Proc. of the 16th Int. Conf. on Multimodal Interaction , pp. 481 –486,  ACM  (2014).
Jia  Y.  et al., “Caffe: convolutional architecture for fast feature embedding,” in  Proc. of the ACM Int. Conf. on Multimedia , pp. 675 –678,  ACM  (2014).
Varma  M., and Babu  B. R., “More generality in efficient multiple kernel learning,” in  Proc. of the 26th Annual Int. Conf. on Machine Learning , pp. 1065 –1072,  ACM , (2009).
Rakotomamonjy  A.  et al., “SimpleMKL,” J. Mach. Learn. Res.. 9, , 2491 –2521 (2008).

Bo Sun received his BSc degree in computer science from Beihang University, China, and his MSc and PhD degrees from Beijing Normal University, China. He is currently a professor in the Department of Computer Science and Technology, Beijing Normal University. His research interests include pattern recognition, natural language processing, and information systems. He is a member of ACM and a senior member of China Society of Image and Graphics.

Liandong Li received his BSc degree in computer science and technology from Beijing Normal University, 2011. He is currently working toward the PhD in computer application technology at Beijing Normal University. His research interests include machine learning, computer vision, and emotion analysis.

Guoyan Zhou received her BSc degree in computer science and technology from Beijing Normal University, 2013. Currently, she is working toward the MSc degree in computer application technology at Beijing Normal University. Her research interests include machine learning and computer vision.

Jun He received her BSc degree in optical engineering and her PhD in physical electronics from Beijing Institute of Technology, Beijing, China, in 1998 and 2003, respectively. Since 2003, she has been with the College of Information Science and Technology, Beijing Normal University, Beijing, China. She was elected as a lecturer and an associate professor in 2003 and 2010, respectively. Her research interests include image processing application, and pattern recognition.

© The Authors. Published by SPIE under a Creative Commons Attribution 3.0 Unported License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.

Citation

Bo Sun ; Liandong Li ; Guoyan Zhou and Jun He
"Facial expression recognition in the wild based on multimodal texture features", J. Electron. Imaging. 25(6), 061407 (Jun 22, 2016). ; http://dx.doi.org/10.1117/1.JEI.25.6.061407


Figures

Graphic Jump Location
Fig. 1
F1 :

Samples of facial expression data of SFEW. The expressions shown are from the first line left to second line right, anger, disgust, fear, happiness, neutral, sadness, and surprise. The image data are quite different in the illumination status and character postures.

Graphic Jump Location
Fig. 2
F2 :

Deep CNN architecture for feature learning.

Graphic Jump Location
Fig. 3
F3 :

Learnt features of the first convolutional layers. The left one (a) belongs to the AlexNet while the right one (b) belongs to our proposed CNN.

Graphic Jump Location
Fig. 4
F4 :

Pipeline of handcrafted features extraction. The dashed box means that the temporal–spatial representation is only used for AFEW dataset.

Graphic Jump Location
Fig. 5
F5 :

Our proposed SIFT-LBP temporal–spatial representation for video.

Graphic Jump Location
Fig. 6
F6 :

Layers of proposed fusion network.

Tables

Table Grahic Jump Location
Table 1The number of data for each expression in AFEW, SFEW, and FER dataset.
Table Grahic Jump Location
Table 2Comparison results of proposed CNN model, on color and gray image data.
Table Grahic Jump Location
Table 3Recognition accuracies on SFEW, C is the cost parameter of SVM, n is the PLS dimension. P represents the activation value of last pooling layer while FC means the activation value of last FC layer.
Table Grahic Jump Location
Table 4Recognition accuracies on AFEW.
Table Grahic Jump Location
Table 5Fusion results on SFEW dataset. The SVM fusion network means the fusion of SVM results only. In the fusion network, AlexNet and RCNN features are classified by PLS.
Table Grahic Jump Location
Table 6Fusion results on AFEW dataset. In the fusion network, the LPQ-TOP is classified by PLS.

References

Zhao  G., and Pietikainen  M., “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE Trans. Pattern Anal. Mach. Intell.. 29, (6 ), 915 –928 (2007). 0162-8828 CrossRef
Sikka  K.  et al., “Exploring bag of words architectures in the facial expression domain,” in  Computer Vision–ECCV 2012. Workshops and Demonstrations , pp. 250 –259,  Springer ,  Berlin Heidelberg  (2012).
Valstar  M. F.  et al., “The first facial expression recognition and analysis challenge,” in  IEEE Int. Conf. on Automatic Face & Gesture Recognition and Workshops (FG 2011) , pp. 921 –926,  IEEE  (2011).CrossRef
Zhu  X., and Ramanan  D., “Face detection, pose estimation, and landmark localization in the wild,” in  2012 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pp. 2879 –2886,  IEEE  (2012).CrossRef
Xiong  X., and De la Torre  F., “Supervised descent method and its applications to face alignment,” in  IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pp. 532 –539,  IEEE  (2013).CrossRef
Dhall  A.  et al., “Collecting large, richly annotated facial-expression databases from movies,” IEEE Multimedia. 19, (3 ), 34 –41 (2012).CrossRef
Dhall  A.  et al., “Static facial expression analysis in tough conditions: data, evaluation protocol and benchmark,” in  IEEE Int. Conf. on Computer Vision Workshops (ICCV Workshops) , pp. 2106 –2112,  IEEE  (2011).CrossRef
Goodfellow  I. J.  et al., “Challenges in representation learning: a report on three machine learning contests,” in Neural Information Processing. , and Lee  M.  et al., Eds., pp. 117 –124,  Springer ,  Berlin Heidelberg  (2013).
Päivärinta  J., , Rahtu  E., , Heikkilä  J., “Volume local phase quantization for blur-insensitive dynamic texture classification,” in Image Analysis. , , Heyden  A., and Kahl  F., Eds., pp. 360 –369,  Springer ,  Berlin Heidelberg  (2011).
Dalal  N., and Triggs  B., “Histograms of oriented gradients for human detection,” in  IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR 2005) , Vol. 1, pp. 886 –893,  IEEE  (2005).CrossRef
LeCun  Y., , Bengio  Y., and Hinton  G., “Deep learning,” Nature. 521, (7553 ), 436 –444 (2015).CrossRef
Ekman  P., and FriesenRosenberg  E. L., What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). ,  Oxford University Press  (1997).
Kahou  S. E.  et al., “Combining modality specific deep neural networks for emotion recognition in video,” in  Proc. of the 15th ACM on Int. Conf. on Multimodal Interaction , pp. 543 –550,  ACM  (2013).
Liu  M.  et al., “Partial least squares regression on Grassmannian manifold for emotion recognition,” in  Proc. of the 15th ACM on Int. Conf. on Multimodal Interaction , pp. 525 –530,  ACM  (2013).
Liu  M.  et al., “Combining multiple kernel methods on Riemannian manifold for emotion recognition in the wild,” in  Proc. of the 16th Int. Conf. on Multimodal Interaction , pp. 494 –501,  ACM  (2014).
Yao  A.  et al., “Capturing au-aware facial features and their latent relations for emotion recognition in the wild,” in  Proc. of the 2015 ACM on Int. Conf. on Multimodal Interaction , pp. 451 –458,  ACM  (2015).
Kim  B. K.  et al., “Hierarchical committee of deep convolutional neural networks for robust facial expression recognition,” J. Multimodal User Interfaces. 10, (2 ), 173 –189 (2016).CrossRef
Yu  Z., and Zhang  C., “Image based static facial expression recognition with multiple deep network learning,” in  Proc. of the 2015 ACM on Int. Conf. on Multimodal Interaction , pp. 435 –442,  ACM  (2015).
Liu  P.  et al., “Facial expression recognition via a boosted deep belief network,” in  IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2014) , pp. 1805 –1812,  IEEE  (2014).CrossRef
Ng  H. W.  et al., “Deep learning for emotion recognition on small datasets using transfer learning,” in  Proc. of the 2015 ACM on Int. Conf. on Multimodal Interaction , pp. 443 –449,  ACM  (2015).
Krizhevsky  A., , Sutskever  I., and Hinton  G. E., “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems. , pp. 1097 –1105 (2012).
Sikka  K.  et al., “Multiple kernel learning for emotion recognition in the wild,” in  Proc. of the 15th ACM on Int. Conf. on Multimodal Interaction , pp. 517 –524,  ACM  (2013).
Chen  J.  et al., “Emotion recognition in the wild with feature fusion and multiple kernel learning,” in  Proc. of the 16th Int. Conf. on Multimodal Interaction , pp. 508 –513,  ACM  (2014).
Kahou  S. E.  et al., “Combining modality specific deep neural networks for emotion recognition in video,” in  Proc. of the 15th ACM on International conference on multimodal interaction , pp. 543 –550,  ACM  (2013).
Kahou  S. E.  et al., “Recurrent neural networks for emotion recognition in video,” in  Proc. of the 2015 ACM on Int. Conf. on Multimodal Interaction , pp. 467 –474,  ACM  (2015).CrossRef
Gönen  M., and Alpaydın  E., “Multiple kernel learning algorithms,” J. Mach. Learn. Res.. 12, , 2211 –2268 (2011).
Bucak  S. S., , Jin  R., and Jain  A. K., “Multiple kernel learning for visual object recognition: a review,” IEEE Trans. Pattern Anal. Mach. Intell.. 36, (7 ), 1354 –1369 (2014). 0162-8828 CrossRef
Dhall  A.  et al., “Video and image based emotion recognition challenges in the wild: Emotiw 2015,” in  Proc. of the 2015 ACM on Int. Conf. on Multimodal Interaction , pp. 423 –426,  ACM  (2015).
Bradski  G., “The open CV library,” Doctor Dobbs J.. 25, (11 ), 120 –126 (2000).
Russakovsky  O.  et al., “Imagenet large scale visual recognition challenge,” Int. J. Comput. Vision. 115, (3 ), 211 –252 (2014). 0920-5691 
Girshick  R.  et al., “Rich feature hierarchies for accurate object detection and semantic segmentation,” in  IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2014) , pp. 580 –587,  IEEE  (2014).
Vedaldi  A., and Fulkerson  B., “VLFeat: an open and portable library of computer vision algorithms,” in  Proc. of the Int. Conf. Multimedia , pp. 1469 –1472,  ACM  (2010).
Wang  J.  et al., “Locality-constrained linear coding for image classification,” in  IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2010) , pp. 3360 –3367,  IEEE  (2010).CrossRef
Lazebnik  S., , Schmid  C., and Ponce  J., “Beyond bags of features: spatial pyramid matching for recognizing natural scene categories,” in  2006 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR'06) , Vol. 2, pp. 2169 –2178,  IEEE  (2006).CrossRef
Fan  R. E.  et al., “LIBLINEAR: a library for large linear classification,” J. Mach. Learn. Res.. 9, , 1871 –1874 (2008).CrossRef
Sun  B.  et al., “Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild,” in  Proc. of the 16th Int. Conf. on Multimodal Interaction , pp. 481 –486,  ACM  (2014).
Jia  Y.  et al., “Caffe: convolutional architecture for fast feature embedding,” in  Proc. of the ACM Int. Conf. on Multimedia , pp. 675 –678,  ACM  (2014).
Varma  M., and Babu  B. R., “More generality in efficient multiple kernel learning,” in  Proc. of the 26th Annual Int. Conf. on Machine Learning , pp. 1065 –1072,  ACM , (2009).
Rakotomamonjy  A.  et al., “SimpleMKL,” J. Mach. Learn. Res.. 9, , 2491 –2521 (2008).

Some tools below are only available to our subscribers or users with an online account.

Related Content

Customize your page view by dragging & repositioning the boxes below.

Related Book Chapters

Topic Collections

PubMed Articles
Advertisement


 

  • Don't have an account?
  • Subscribe to the SPIE Digital Library
  • Create a FREE account to sign up for Digital Library content alerts and gain access to institutional subscriptions remotely.
Access This Article
Sign in or Create a personal account to Buy this article ($20 for members, $25 for non-members).
Access This Proceeding
Sign in or Create a personal account to Buy this article ($15 for members, $18 for non-members).
Access This Chapter

Access to SPIE eBooks is limited to subscribing institutions and is not available as part of a personal subscription. Print or electronic versions of individual SPIE books may be purchased via SPIE.org.