Improving video foreground segmentation with an object-like pool

Xiaoliu Cheng; Wei Lv; Huawei Liu; Xing You; Baoqing Li; Xiaobing Yuan

doi:10.1117/1.JEI.24.2.023034

23 April 2015 Improving video foreground segmentation with an object-like pool

Xiaoliu Cheng, Wei Lv, Huawei Liu, Xing You, Baoqing Li, Xiaobing Yuan

Author Affiliations +

Journal of Electronic Imaging, Vol. 24, Issue 2, 023034 (April 2015). https://doi.org/10.1117/1.JEI.24.2.023034

Abstract

Foreground segmentation in video frames is quite valuable for object and activity recognition, while the existing approaches often demand training data or initial annotation, which is expensive and inconvenient. We propose an automatic and unsupervised method of foreground segmentation given an unlabeled and short video. The pixel-level optical flow and binary mask features are converted into the normal probabilistic superpixels, therefore, they are adaptable to build the superpixel-level conditional random field which aims to label the foreground and background. We exploit the fact that the appearance and motion features of the moving object are temporally and spatially coherent in general, to construct an object-like pool and background-like pool via the previous segmented results. The continuously updated pools can be regarded as the “prior” knowledge of the current frame to provide a reliable way to learn the features of the object. Experimental results demonstrate that our approach exceeds the current methods, both qualitatively and quantitatively.

1. Introduction

Video foreground segmentation plays a prerequisite role in a variety of visual applications such as safety surveillance¹ and intelligent transportation.² The existing algorithms usually use supervised or semisupervised methods and achieve satisfying results. However, the performances are still limited when they are applied for unsupervised and short videos, because the supervised methods usually demand many training examples that are expensive to manually label. Furthermore, the training examples cannot cover all the conditions and need to retrain the new examples to improve the generalization. Some semisupervised methods require accurate object region annotation only for the first frame, then they exploit the region-tracking methods to segment the rest of the frames. However, many visual applications like safety surveillance demand intelligent and unattended operations, which make the initial annotation impractical. The available video frames may be insufficient sometimes since the objects can move rapidly into and out of the visual field when they are near the camera.

There has been a substantial amount of work related to foreground segmentation. Classical segmentation methods that operate at the pixel level are often based on local features like textons,³ then they are augmented by Markov random field or graph-cut based methods to gain the refined results.⁴^,⁵ Furthermore, some new methods of this type regard the meaningful superpixels as the basic units instead of the rigid pixels to get better results,⁶^–¹⁰ because superpixels are efficient in practice and more robust to noise than pixels, and work well for representing objects as well. For instance, Tian et al.⁶ propose two superpixel-based data terms and smooth terms defined on the spatiotemporal superpixel neighborhood with a shape cue to implement the segmentation. Their method can handle arbitrary length video sequences although it demands that the first frame be manually labeled. Shu et al.⁹ apply a superpixel-based bag-of-words model to iteratively refine the output of a generic detector, then an online-learning appearance model is exploited to train a support vector machine and to achieve the exact objects using conditional random field (CRF). However, it requires a mass of various examples to train the classifier, and it is not well adapted to short videos.

Perhaps the work that is related most to ours is that of Schick et al.⁸ They convert the traditional pixel-based segmentation into a probabilistic superpixel representation and integrate the structure information and similarities into Markov random field (MRF) to improve the segmentation. The shape of the object in the given foreground segmentation is improved by their probabilistic superpixel Markov random field (PSP-MRF) method. Moreover, it also reduces the noisy regions and improves recall, precision, and $F$ -measure. However, it stringently depends on the binary mask (see Sec. 3.3). For instance, if the given binary mask is quite poor because of the cluttered background, the performance will rapidly decline. In addition, full use is not made of the local features and environmental information to achieve more robust results.

In order to improve the performance of unsupervised and short video segmentation, we proposed an online unsupervised learning approach inspired by Ref. 9. The intuition is that the appearance and motion features of the moving object vary slowly frame by frame in a typical video. According to the temporal and spatial coherence, we can exploit the segmented result of the previous frame to provide valuable cues for the current segmentation.

This paper aims to segment the moving foreground from the unlabeled and short video in an unsupervised way without prior knowledge. The overview of our approach is illustrated in Fig. 1. The main contributions of our work are listed as follows: (1) The pixel-level optical flow and binary mask features are converted into the normalized probabilistic superpixels, which fit very well for the CRF. (2) Because of the temporal and spatial coherence of appearance and motion features of the moving object, we leverage the previous segmented result to build an object-like pool and background-like pool, which serve as the “prior” knowledge of the current segmentation. The continuously updated pools provide a reliable and continuous way to learn the features of the object. The proposed algorithm has been validated by several challenging videos from the change detection 2014 dataset, and experimental results demonstrate that our approach outperforms the other methods both in accuracy and robustness, even when the basic features suffer from great interference.

Fig. 1

The overview of our approach: (a) input sequential frames, (b) moving region, (c) binary mask, (d) superpixel-level optical flow, (e) foreground likelihood, (f) segmented results, (g) object-like pool, and (h) background-like pool.

The rest of this paper is organized as follows: Sec. 2 presents our detailed approach. Experimental results are given in Sec. 3 and conclusions are discussed in Sec. 4.

2. Our Approach

Since we have no prior knowledge about the unlabeled video, we actually know nothing about the object at first: we do not know its type, size, moving direction, and so on. Similarly, the scenario is also unpredictable: it may suffer from swaying trees, illumination change, bad weather, shadows, and so on. Therefore, an unsupervised and efficient approach should be developed because of the limited information in the short video.

First, the optical flow field is regarded as the initial detector to extract the moving region, which is actually a coarse bounding box. Second, the pixel-level optical flow and binary mask features are converted into the normalized probabilistic superpixels. Combining the normalized probabilistic superpixels with the foreground likelihood that is generated by the object-like pool and background-like pool, we build a superpixel-based CRF model to provide a natural way to learn the conditional distribution over the class labeling. Afterward, the graph-cut based method is adopted to achieve the foreground segmentation. Last, an exceptional handling mechanism is applied to avoid error accumulation in the case of abnormal events.

2.1.

Superpixel Segmentation

Superpixels¹¹^,¹² have become a significant tool in computer vision. They group pixels into meaningful subregions instead of rigid pixels which can greatly reduce the complexity of the task in image processing. What is more, the superpixels have uniform information in color and space and adhere well to the contour of the object. So far they have become the basic blocks of many computer vision algorithms, such as object segmentation,⁹ depth estimation,¹³ and object tracking.¹⁴ As a kind of middle-level feature, superpixels both increase the speed and improve the quality of the segmented results.

Simple linear iterative clustering (SLIC)¹⁵ is an efficient method of superpixel segmentation, which is also simple to implement and easy to apply in practice. In this paper, we set a proper size of superpixels ( $8 \times 8$ in all the experiments) and segment the image with the SLIC algorithm. Then we acquire the table of the labeled superpixels, the seeds of the superpixels, and the number of the superpixels. Specifically, the table shows the label values of all the pixels and the maximum value represents the total number of the final superpixels. Note that the exact number of the segmented superpixels is usually not equal to the given number because some small superpixels are integrated into the larger ones. The seeds of the superpixels are used to judge the neighbor information since the labeled values of superpixels are not in order.

2.2.

Probabilistic Superpixels

The pixel-level processing is vulnerable to unpredictable noise and it suffers from a heavy calculation burden as well. In order to achieve a robust and efficient segmentation, we operate at the superpixel level in the following steps. According to Ref. 8, a probabilistic superpixel gives the probability that its pixels belong to a certain class, so it fits well into the probabilistic frameworks like CRF, as we will show later.

Though without prior knowledge, the pixel-level optical flow and binary mask can be converted into probabilistic superpixels to measure the foreground likelihood. Let $B$ be the pixel-level binary mask and sp a superpixel with pixels $p \in sp$ and $| sp |$ its size, so the likelihood of the superpixel-based binary mask to construct the object is defined as⁸

Eq. (1)

L_{binary} (sp) = \frac{\sum_{p \in sp} B (p)}{| sp |} .

The optical flow of each superpixel is represented by the average optical flow of the inside pixels. Then the likelihood of a superpixel sp (let $\vec{sp}$ be its optical flow vector) to form the foreground based on optical flow is defined as

Eq. (2)

L_{flow} (\vec{sp}) = \cos 〈 \vec{sp}, \vec{r} 〉 \cdot \frac{‖ \vec{sp} ‖}{‖ \vec{r} ‖},

where

〈 \vec{sp}, \vec{r} 〉

denotes the angle between the vectors

\vec{sp}

and

\vec{r}

. The reference optical flow vector

\vec{r}

is defined by the mean optical flow of all the superpixels in the moving region. Finally, the superpixel-level optical flow and binary mask are normalized to represent the foreground and background probabilities by the following equations:

Eq. (3)

P_{fg} = α \cdot L_{flow} + (1 - α) \cdot L_{binary},

Eq. (4)

P_{bg} = 1 - P_{fg},

where

α \in (0,1)

represents the tradeoff between the features of the binary mask and the optical flow.

2.3.

Superpixel-Based Conditional Random Field

CRF¹⁶ is a class of statistical modeling methods widely applied to computer vision. According to the result of superpixel segmentation, the foreground objects are usually over-segmented and are consisted of more than one superpixel. Therefore, it is essential to cluster and label the superpixels based on their features. Fortunately, CRF provides a natural way to incorporate superpixel-based features into a single unified model³ to learn the conditional distribution over the class labeling.

Let $G (S, E)$ be the adjacent graph of superpixels ${sp}_{i}$ ( ${sp}_{i} \in S$ ) in a frame, and $E$ is the set of edges formed between pairs of adjacent superpixels in the eight-connected neighbors. Let $P (c | G; w)$ be the conditional probability¹⁰ of the set of class assignments $c$ given the adjacent graph $G (S, E)$ and a weight $w$

Eq. (5)

- \log [P (c | G; w)] = \sum_{{sp}_{i} \in S} Ψ (c_{i} | {sp}_{i}) + w \sum_{({sp}_{i}, {sp}_{j}) \in E} Φ (c_{i}, c_{j} | {sp}_{i}, {sp}_{j}),

where

Ψ (\cdot)

and

Φ (\cdot)

represent the unary potential and pairwise edge potential, respectively.

The unary potential $Ψ (\cdot)$ defines the cost of labeling superpixel ${sp}_{i}$ with label $c_{i}$ , and it is represented as follows:

Eq. (6)

Ψ (c_{i} | {sp}_{i}) = - \log [P_{fg} (c_{i}, {sp}_{i})] .

The relationship between two adjacent superpixels ${sp}_{i}$ and ${sp}_{j}$ is modeled by the pairwise potential⁴ $Φ (\cdot)$

Eq. (7)

Φ (c_{i}, c_{j} | {sp}_{i}, {sp}_{j}) = [c_{i} \neq c_{j}] \exp (- β {‖ c_{i} - c_{j} ‖}^{2}),

Eq. (8)

β = {(2 〈 {‖ c_{i} - c_{j} ‖}^{2} 〉)}^{- 1} | ({sp}_{i}, {sp}_{j}) \in E,

where

[\cdot]

denotes the indicator function with values 0 or 1,

{‖ c_{i} - c_{j} ‖}^{2}

is the

L_{2}

norm of the color difference between two adjacent nodes in LAB color space, and

〈 \cdot 〉

is the expectation operator.

The conditional probability can be optimized by graph cuts.¹⁷ Once the CRF model has been built, we minimize Eq. (5) with the multilabel graph-cuts¹⁸^–²⁰ based on an optimization library¹⁰ using the swap algorithm. This is quite efficient since the CRF model is defined on the superpixel-level graph.

2.4.

Pools Construction

Now the superpixels are classified into two clusters: foreground and background. In order to learn the features of the object from the segmented result, the superpixels belonging to the foreground and background are separately selected to construct the object-like pool $o^{t - 1}$ and the background-like pool ${bg}^{t - 1}$

Eq. (9)

o^{t - 1} = {{sp}_{i}}, {sp}_{i} \in foreground,

Eq. (10)

{bg}^{t - 1} = {{sp}_{j}}, {sp}_{j} \in background,

where

o^{t - 1}

and

{bg}^{t - 1}

are the independent object-like pool and background-like pool that are generated from the segmented result of the

(t - 1)

’th frame. The color distribution and optical flow of each superpixel within the pools have already been recorded. Based on the temporal and spatial coherence of appearance and motion features, the real object in the next frame should be similar to the previous segmented foreground for both color and optical flow. Therefore, the two pools can be regarded as the “prior” knowledge for the object in the next frame. By comparing the features of the “new” superpixels in the current frame and the “old” superpixels in the two pools, we assign each “new” superpixel a likelihood of its belonging to the foreground.

2.5.

Foreground Likelihood

Based on the segmented result of the previous frame, the object-like pool $o^{t - 1}$ formed by the $(t - 1)$ ’th frame is achieved. As discussed above, $o^{t - 1}$ can be regarded as the “prior” knowledge of current frame $t$ , hence the key features about the object can be learned. Let ${sp}_{i}^{t}$ be the $i$ ’th superpixel in frame $t$ and ${sp}_{k}^{t - 1}$ ( ${sp}_{k}^{t - 1} \in o^{t - 1}$ ) be one of the nearest $M_{k}$ neighbors of ${sp}_{i}^{t}$ . The similarity to the object about ${sp}_{i}^{t}$ is denoted as

Eq. (11)

S_{o}^{t} ({sp}_{i}^{t}) = \frac{1}{M_{k}} \sum_{{sp}_{k}^{t - 1} \in N (s p_{i}^{t})} H ({sp}_{k}^{t - 1}) \cdot H {({sp}_{i}^{t})}^{T} \cdot \exp [- \frac{D (\vec{{sp}_{k}^{t - 1}}, \vec{{sp}_{i}^{t}})}{η}],

where

H (\cdot)

and

D (\cdot)

are the histogram distribution and the Euclidean distance between optical flow vectors, respectively. The optical flow vector of

{sp}_{i}^{t}

is denoted as

\vec{{sp}_{i}^{t}}

and

η

is the expectation of

D (\cdot)

.

Similarly, we repeat the aforementioned procedures with the background-like pool ${bg}^{t - 1}$ and obtain the background similarity $S_{bg}^{t}$ , so the likelihood of a certain superpixel in frame $t$ belonging to the foreground should be

Eq. (12)

L_{fg}^{t} = S_{o}^{t} / (S_{o}^{t} + S_{bg}^{t}) .

The comprehensive probability of the superpixels to form the foreground is represented as

Eq. (13)

P_{fg} = β \cdot L_{flow} + γ \cdot L_{fg} + (1 - β - γ) \cdot L_{binary},

where

β

and

γ

weight the three features.

β, γ \in (0,1)

and

(β + γ) \in (0,1)

.

Then we jump to Sec. 2.3, where $P_{fg}$ is calculated by Eq. (13) instead of Eq. (3). Just as before, a new superpixel-based CRF model is built and a new segmentation is implemented by graph cut.

2.6.

Exception Handing

The object-like pool works well most of the time, and the segmented results will theoretically be improved frame by frame. However, when the previous segmented foreground is mixed with some noise, it will have a negative effect on the object-like pool. Furthermore, the error will be accumulated in the current segmentation based on the inaccurate object-like pool, so the vicious circle occurs. This is most likely to happen from the first initial segmentation because the initially segmented result is coarse in general. Therefore, some measures should be taken to prevent the error accumulation.

Let $R_{n}^{t}$ be the mean ratio of the number of superpixels in the object-like pool from frame $(t - n)$ to frame $t$

Eq. (14)

R_{n}^{t} = \frac{1}{n} \sum_{i = 1}^{n} \frac{N_{sp}^{t - i + 1}}{N_{sp}^{t - i}},

where

N_{sp}^{t}

represents the number of the foreground superpixels from frame

t

. Therefore,

R_{1}^{t}

is the ratio of the foreground superpixels from frame

(t - 1)

to frame

t

. Let

R

be the set of the normal ratios. Then the state of the object-like pool is represented as

Eq. (15)

state = {\begin{matrix} normal, R_{1}^{t} \in R; & if R_{1}^{t} / R_{n}^{t - 1} \in (1 - λ, 1 + λ) \\ abnormal, R_{1}^{t} \notin R; & others \end{matrix} .

The parameter $n$ ( $n = 3$ recommended) denotes the number of previous reference frames, and the parameter $λ$ ( $λ = 0.2$ in our experiments) is the offset of the floor and ceiling bounds, respectively.

Once the state of the object-like pool is abnormal, the exception handling is activated. Then, we discard the object-like pool and the background-like pool and reinitialize the foreground likelihood based on Eq. (3) instead of Eq. (13). The exception handling mechanism is quite effective to avoid error accumulation.

3. Experimental Results

Our algorithm is evaluated by several challenging datasets: “bungalows,” “twoPositionPTZCam,” “highway,” “fall,” “snowFall,” and “blizzard.” They are from the Change Detection 2014 dataset and provide a range of running out of sight, direction change, shadow, dynamic background, partial occlusion, bad weather, and similar color. The proposed algorithm (ours) is compared with a binary mask (BM), ours-shortcut (ours-SC), and PSP-MRF algorithms.⁸ Note that the ours-SC algorithm is short of the object-like pool and background-like pool that provide “prior” information for the next segmentation. In addition, only a few sequential frames (less than 25 in all the experiments) are chosen to run our unsupervised algorithm, because we do not need huge frames to build and update the background model or to serve as the training frames. In addition, we only pay attention to a single rigid moving object with the motionless camera in our experiments.

3.1.

Qualitative Evaluation

The dataset provides various noises: “bungalows” shows the condition where the moving object is running out of the camera’s visual field, so several frames only capture a part of the object. In the “twoPositionPTZCam,” the object continuously changes its moving direction around the corner. The car in “highway” suffers from shadows from the upper trees, and “fall” presents the dynamic background of the swaying leaves and the partial occlusion from the middle tree. In addition, a mass of the snow is falling down in the “snowFall,” in very bad weather. In “blizzard,” the small car has a similar color as the snowy background.

Figure 2 shows the qualitative results of ours, ours-SC, PSP-MRF, and ground truth. BM results are not drawn because they are mostly fragmentary which will make the results cluttered. According to the visual evaluation, the PSP-MRF method performs the worst on average because of the incomplete and even fragmentary segmentations. Furthermore, ours-SC achieves better results than PSP-MRF, although it still lacks some detailed components of the object. By learning the object-like pool and background-like pool, our approach outperforms all the compared methods in terms of robustness and completeness.

Fig. 2

Visual segmentation results: (a) bunglows, (b) twoPositionPTZCam, (c) highway, (d) fall, (e) snowfall, and (f) blizzard. The results of ours, ours-SC, PSP-MRF, and ground truth are respectively represented by red, green, blue, and yellow curves.

3.2.

Quantitative Evaluation

The performances of different methods are evaluated by two measures: $F$ -measure and percentage of wrong classification (PWC). $F$ -measure is the harmonically weighted balance of precision and recall.²¹ $F$ -measure and PWC are specifically defined as

Eq. (16)

F - measure = 2 \times \frac{Precision \times Recall}{Precision + Recall},

Eq. (17)

PWC = \frac{FN + FP}{TP + TN + FP + FN},

where TP, TN, FP, and FN are abbreviations for true positive, true negative, false negative, and false negative, respectively. The detailed quantitative performances are shown in Fig. 3. Although ours-SC shows comparatively good results in “snowFall” and “blizzard,” it sometimes produces terrible results (see the result of “fall”). We conclude that it is not robust and neither is the PSP-MRF. Above all, the average scores of our method in terms of

F

-measure and PWC perform the best compared with the others.

Fig. 3

Performance comparison of different methods. (a) the quantitative result of bunglows, (b) the quantitative result of twoPositionPTZCam, (c) the quantitative result of highway, (d) the quantitative result of fall, (e) the quantitative result of snowfall, and (f) the quantitative result of blizzard.

3.3.

Impact of Binary Mask

Binary mask is one of the basic cues which is exploited by PSP-MRF, ours-SC, and ours. Specifically, it makes up the probabilistic superpixels in the PSP-MRF and occupies a weighted part in both ours-SC and ours, so their results are closely related to the binary mask. In the implementation of the binary mask, we use the temporal difference method. Although it is simple and sensitive for detecting changes, it has poor antinoise performance and outputs an incomplete object with “ghosts” (see the rapidly descending magenta line in “bungalows” in Fig. 3).

In Fig. 3, it is easy to see that the blue PSP-MRF line has a certain positive correlation with the magenta BM line. According to Ref. 8, the binary mask directly determines the unary term, which captures the likelihood of superpixels belonging to the foreground. As a result, the performance of PSP-MRF gets worse when the binary mask goes bad. Furthermore, ours-SC method fuses the optical flow and binary mask together, so its performance is partly influenced by the binary mask. Moreover, with the object-like pool and background-like pool, our method is only slightly influenced by the binary mask even when it goes bad (see red line in “bungalows,” “twoPositionPTZCam,” and “blizzard” in Fig. 3). Overall, the proposed algorithm is the least sensitive to the performance of the binary mask.

3.4.

Impact of Optical Flow

Similar to the binary mask discussed previously, optical flow constitutes one of the elements of ours-SC and ours. However, it is vulnerable to noise that may be generated from the illumination change or an area with the same color. For example, in the “fall” dataset of Fig. 3, the reflection of the ground increases the error of the optical flow and the green line goes bad quickly even though the binary mask is not so bad. In contrast, our algorithm remains the best under this condition. Similar to the binary mask, the proposed algorithm is also the least sensitive to the performance of the optical flow.

3.5.

Effectiveness of Object-Like Pool

To further evaluate the effectiveness of our object-like pool, a comparison is conducted between the method with (ours) and without the object-like pool (ours-SC). According to the performance in Fig. 3, our proposed algorithm achieves the smoothest and highest $F$ -measure curves and the least PWC on average, while the curves of ours-SC fluctuate heavily and perform worse than ours. The reason is that the object-like pool provides a reliable and continuous way to propagate the object against the noise from other features. Besides, the details of the objects with our algorithm can still be improved even when ours-SC has already achieved good results, as with the performances of “snowFall” and “blizzard” as shown in Fig. 3. In brief, the proposed method with an object-like pool achieves more robust and accurate results than the methods without the object-like pool.

3.6.

Impact of Parameters Selection

To study the sensitivity of parameter selection, different parameters of $α$ , $β$ , and $γ$ are chosen. Taking the typical “bungalows” as an example, we calculate the segmented results based on three groups of parameters and the performance is illustrated in Fig. 4. We call the “bungalows” typical because the last two frames have achieved comparatively satisfying optical flows but terrible binary masks, which are balanced by $α$ , $β$ , and $γ$ . According to the $F$ -measure curves in Fig. 4, the last two points of ours-SC descend quickly with the increasing weight of the binary mask. However, our approach still maintains an excellent performance even while being faced with the awful binary mask. Therefore, our approach is more robust than ours-SC in terms of the parameters.

Fig. 4

Performance comparison of different parameters. (a) High weight for optical flow and low weight for binary mask. (b) Equal weights of optical flow and binary mask. (c) Low weight for optical flow and high weight for binary mask.

3.7.

Comparison of Computational Complexity

The computational complexity is introduced to make a scientific comparison of the time cost in different approaches. We first establish the notations used.

1. Let $H$ and $W$ be the height and width of the video frame.
2. Let $h$ and $w$ be the height and width of the moving region.
3. Let $K$ be the total number of the superpixels.
4. Let $S$ be the number of the pixels between two adjacent seeds of the superpixels.
5. Let $T$ be the iterations of superpixel segmentation in the SLIC method.
6. Let $L$ be the length of the search range in the SLIC method.
7. Let $N$ be the number of the neighbors described in Eq. (11).

According to the detailed algorithm of SLIC, it’s running time is $O (w h T L^{2} / K)$ . We set $T = 10$ and $T = 3$ for the realization of SLIC in all the experiments, and $K$ is generally larger than 100. Therefore, we have $O (w h T L^{2} / K) \leq O (w h) < O (W H)$ . The proposed object-like pool and background-like pool cost $O (N K)$ running time in total, in which we choose $N = 9$ as the nine-connected neighbors in Eq. (11). Since the features of the binary mask and optical flow are defined at the superpixel level, we can figure out that they take at most $O (K) \leq O (W H)$ running time. The implementation of graph cut costs $O (w h / S^{2}) = O (K)$ running time because of $S = \sqrt{w h / K}$ .

Based on the mentioned inferences, we compare our approach (ours) in terms of computational complexity with ours-SC, PSP-MRF, and BM in Table 1. We find that the computational complexity of all the methods is equal in polynomial time.

Table 1

Computational complexity of different methods.

Method	Computational complexity
Binary mask (BM)	$O (W H)$
PSP-MRF	$O (w h T L^{2} / K) + O (W H) = O (W H)$
Ours-shortcut (ours-SC)	$O (w h T L^{2} / K) + O (W H) + O (K) = O (W H)$
Ours	$O (w h T L^{2} / K) + O (W H) + O (K) + O (N K) = O (W H)$

4. Conclusions

We proposed a robust and effective method to improve the unlabeled short video segmentation based on the object-like pool. Our approach exploits the temporal and spatial coherence of appearance and motion features of the moving object to generate the foreground likelihood across the frames. According to the qualitative and quantitative results, our approach exceeds the other compared methods, both in accuracy and robustness, even when the binary mask and optical flow suffer from great interference.

However, the proposed algorithm still has some limitations. Occasionally we need to empirically tune the weighted parameters among different features to produce satisfactory results, so an intelligent and adaptive method to automatically generate weights should be developed. In addition, our method works worse for nonrigid objects than rigid objects because of the conflicting optical flow within them. Therefore, a more generalized algorithm should be proposed to solve this problem in further work.

Acknowledgments

This work is partly supported by the National Natural Science Foundation of China (14ZR1447200).

References

1.

S. C. Huang, “An advanced motion detection algorithm with video quality analysis for video surveillance systems,” IEEE Trans. Circuits Syst. Video Technol., 21 (1), 1 –14 (2011). http://dx.doi.org/10.1109/TCSVT.2010.2087812 ITCTEM 1051-8215 Google Scholar

2.

N. C. Mithun, N. U. Rashid and S. M. M. Rahman, “Detection and classification of vehicles from video using multiple time-spatial images,” IEEE Trans. Intell. Transp. Syst., 13 (3), 1215 –1225 (2012). http://dx.doi.org/10.1109/TITS.2012.2186128 1524-9050 Google Scholar

3.

J. Shotton et al., “Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation,” Lec. Notes Comput. Sci., 3951 1 –15 (2006). http://dx.doi.org/10.1007/11744023 LNCSD9 0302-9743 Google Scholar

4.

D. Zhang, O. Javed and M. Shah, “Video object segmentation through spatially accurate and temporally dense extraction of primary object regions,” in Proc. 2013 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 628 –635 (2013). Google Scholar

5.

X. M. He, R. S. Zemel and M. A. Carreira-Perpinan, “Multiscale conditional random fields for image labeling,” in Proc. 2004 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, 695 –702 (2004). Google Scholar

6.

Z. Q. Tian et al., “Video object segmentation with shape cue based on spatiotemporal superpixel neighbourhood,” IET Comput. Vision, 8 (1), 16 –25 (2014). http://dx.doi.org/10.1049/iet-cvi.2012.0189 1751-9632 Google Scholar

7.

X. F. Wang and X. P. Zhang, “A new localized superpixel Markov random field for image segmentation,” in Proc. IEEE Int. Conf. on Multimedia and Expo (ICME 2009), 642 –645 (2009). Google Scholar

8.

A. Schick, M. Bauml and R. Stiefelhagen, “Improving foreground segmentations with probabilistic superpixel Markov random fields,” in Proc. 2012 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition Workshops (CVPRW), 27 –31 (2012). Google Scholar

9.

G. Shu, A. Dehghan and M. Shah, “Improving an object detector and extracting regions using superpixels,” in Proc. 2013 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 3721 –3727 (2013). Google Scholar

10.

B. Fulkerson, A. Vedaldi and S. Soatto, “Class segmentation and object localization with superpixel neighborhoods,” in Proc. 2009 IEEE 12th Int. Conf. on Computer Vision (ICCV), 670 –677 (2009). Google Scholar

11.

X. F. Ren and J. Malik, “Learning a classification model for segmentation,” in Proc. Ninth IEEE Int. Conf. on Computer Vision, 10 –17 (2003). Google Scholar

12.

A. Levinshtein et al., “Turbopixels: fast superpixels using geometric flows,” IEEE Trans. Pattern Anal. Mach. Intell., 31 (12), 2290 –2297 (2009). http://dx.doi.org/10.1109/TPAMI.2009.96 ITPIDJ 0162-8828 Google Scholar

13.

Y. Yuan, J. W. Fang and Q. Wang, “Robust superpixel tracking via depth fusion,” IEEE Trans. Circuits Syst. Video Technol., 24 (1), 15 –26 (2014). http://dx.doi.org/10.1109/TCSVT.2013.2273631 ITCTEM 1051-8215 Google Scholar

14.

F. Yang, H. C. Lu and M. H. Yang, “Robust superpixel tracking,” IEEE Trans. Image Process., 23 (4), 1639 –1651 (2014). http://dx.doi.org/10.1109/TIP.2014.2300823 IIPRE4 1057-7149 Google Scholar

15.

R. Achanta et al., “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. Pattern Anal. Mach. Intell., 34 (11), 2274 –2281 (2012). http://dx.doi.org/10.1109/TPAMI.2012.120 ITPIDJ 0162-8828 Google Scholar

16.

C. Sutton, A. McCallum and K. Rohanimanesh, “Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data,” J. Mach. Learn. Res., 8 693 –723 (2007). 1532-4435 Google Scholar

17.

Y. Boykov and G. Funka-Lea, “Graph cuts and efficient N-D image segmentation,” Int. J. Comput. Vision, 70 (2), 109 –131 (2006). http://dx.doi.org/10.1007/s11263-006-7934-5 IJCVEQ 0920-5691 Google Scholar

18.

V. Kolmogorov and R. Zabih, “What energy functions can be minimized via graph cuts?,” IEEE Trans. Pattern Anal. Mach. Intell., 26 (2), 147 –159 (2004). http://dx.doi.org/10.1109/TPAMI.2004.1262177 ITPIDJ 0162-8828 Google Scholar

19.

Y. Boykov, O. Veksler and R. Zabih, “Fast approximate energy minimization via graph cuts,” IEEE Trans. Pattern Anal. Mach. Intell., 23 (11), 1222 –1239 (2001). http://dx.doi.org/10.1109/34.969114 ITPIDJ 0162-8828 Google Scholar

20.

Y. Boykov and V. Kolmogorov, “An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision,” IEEE Trans. Pattern Anal. Mach. Intell., 26 (9), 1124 –1137 (2004). http://dx.doi.org/10.1109/TPAMI.2004.60 ITPIDJ 0162-8828 Google Scholar

21.

L. Maddalena and A. Petrosino, “A self-organizing approach to background subtraction for visual surveillance applications,” IEEE Trans. Image Process., 17 (7), 1168 –1177 (2008). http://dx.doi.org/10.1109/TIP.2008.924285 IIPRE4 1057-7149 Google Scholar

Biography

Xiaoliu Cheng received his BS degree in electronic science and technology from Zhengzhou University, Zhengzhou, China, in 2011. From 2011 to 2012, he studied signal processing at the University of Science and Technology of China, Hefei, China. Currently, he is pursuing his PhD degree at the Shanghai Institute of Microsystem and Information Technology (SIMIT), Chinese Academy of Sciences (CAS), Shanghai, China. His research interests include computer vision, machine learning, and wireless sensor networks.

Wei Lv received his MS degree from Harbin Engineering University, Harbin, China, in 2007. She is an assistant researcher at SIMIT, CAS, Shanghai, China. Her research interests include image processing and wireless sensor networks.

Huawei Liu received his MS degree from Harbin Engineering University, Harbin, China, in 2008. He is an assistant researcher at SIMIT, CAS, Shanghai, China. His research interests include sensor signal processing and wireless sensor networks.

Xing You received her PhD from SIMIT, CAS, Shanghai, China, in 2013. She is an assistant professor at SIMIT, CAS, Shanghai, China. Her research interests include video processing and information hiding.

Baoqing Li received his PhD from the State Key Laboratory of Transducer Technology, Shanghai Institute of Metallurgy, CAS, Shanghai, China, in 1999. Currently, he is a professor at SIMIT, CAS, Shanghai, China. His research interests include signal processing, microelectromechanical systems, and wireless sensor networks.

Xiaobing Yuan received his PhD from the Changchun Institute of Optics, Fine Mechanics and Physics, CAS, Changchun, China, in 2000. Currently, he is a professor at SIMIT, CAS, Shanghai, China. His research interests include wireless sensor networks, information transmission and processing.

CC BY: © The Authors. Published by SPIE under a Creative Commons Attribution 4.0 Unported License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.

Citation Download Citation

Xiaoliu Cheng, Wei Lv, Huawei Liu, Xing You, Baoqing Li, and Xiaobing Yuan "Improving video foreground segmentation with an object-like pool," Journal of Electronic Imaging 24(2), 023034 (23 April 2015). https://doi.org/10.1117/1.JEI.24.2.023034

Published: 23 April 2015

Access the abstract

JOURNAL ARTICLE
8 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

CITATIONS

Cited by 3 scholarly publications.

Explore citations on Lens.org

KEYWORDS

Binary data

Optical flow

Video

Image segmentation

Video surveillance

Visualization

Spatial coherence

1.

Introduction

Fig. 1

2.

Our Approach

2.1.

Superpixel Segmentation

2.2.

Probabilistic Superpixels

Eq. (1)

Eq. (2)

Eq. (3)

Eq. (4)

2.3.

Superpixel-Based Conditional Random Field

Eq. (5)

Eq. (6)

Eq. (7)

Eq. (8)

2.4.

Pools Construction

Eq. (9)

Eq. (10)

2.5.

Foreground Likelihood

Eq. (11)

Eq. (12)

Eq. (13)

2.6.

Exception Handing

Eq. (14)

Eq. (15)

3.

Experimental Results

3.1.

Qualitative Evaluation

Fig. 2

3.2.

Quantitative Evaluation

Eq. (16)

Eq. (17)

Fig. 3

3.3.

Impact of Binary Mask

3.4.

Impact of Optical Flow

3.5.

Effectiveness of Object-Like Pool

3.6.

Impact of Parameters Selection

Fig. 4

3.7.

Comparison of Computational Complexity

Table 1

4.

Conclusions

Acknowledgments

References

Biography

Show All Keywords

Keywords/Phrases

Search In:

Publication Years