Video frame-matching algorithm using dynamic programming

Young-Yoon Lee; Chang-Su Kim; Sang-Uk Lee

doi:10.1117/1.3092367

1 January 2009 Video frame-matching algorithm using dynamic programming

Young-Yoon Lee, Chang-Su Kim, Sang-Uk Lee

Author Affiliations +

Journal of Electronic Imaging, Vol. 18, Issue 1, 010504 (January 2009). https://doi.org/10.1117/1.3092367

Abstract

We propose a frame-matching algorithm for video sequences, when a video sequence is modified from its original through frame removal, insertion, shuffling, and data compression. The proposed matching algorithm defines an effective matching cost function and minimizes cost using dynamic programming. Experimental results show that the proposed algorithm provides a significantly lower probability of matching errors than the conventional algorithm.

1. Introduction

Video matching techniques are motivated by various applications, including video retrieval, surveillance, and watermarking. In video retrieval,^{1, 2} frame matching is used to retrieve video clips, similar to a query clip, from a database. In video surveillance using multiple cameras,³ it is necessary to establish the correspondences in time and space to combine asynchronous multiview sequences. In video watermarking,^{4, 5, 6} temporal synchronization should be established between the original and watermarked videos before the extraction of frame-dependent watermarks.

Video matching techniques have different accuracy requirements, depending on their target applications. In video retrieval,^{1, 2} video matching need not be accurate at the frame level. It is acceptable that a retrieved frame does not match a query frame exactly, as long as they contain similar contents. In contrast, in video watermarking, more accurate matching is required, and two video sequences should be synchronized at the frame level, since the matching is a preprocessing step before the extraction of frame-dependent watermarks.⁸ For instance, in Ref. 6, the pattern of skipped frames is employed as a watermark, and the accurate frame-level synchronization between an original video and its watermarked version is an essential step in watermark detection.

In watermarking applications, Delannay, Roover, and Macq⁴ used an affine model to align video sequences. An affine model, however, cannot effectively describe irregular frame insertion and removal as well as frame shuffling, which are employed to attack watermarked video sequences. Cheng⁵ proposed a temporal registration algorithm to correct temporal misalignment, which occurs in frame rate conversion or video capturing. The two-frame integration model was employed to represent temporally overlapping frames in the acquisition procedures.

In video watermarking, the original video can be modified by various temporal attacks, such as frame removal, insertion, and shuffling. To establish frame-level synchronization between original and modified sequences in video watermarking applications, we propose a dynamic programming algorithm that matches each frame in the modified sequence to the corresponding frame in the original sequence. Although Ref. 5 also employed dynamic programming to match two sequences, they used only frame dissimilarity in the matching. On the other hand, we introduce the notion of adaptive unmatched cost in addition to the matching cost to achieve more accurate matching. Moreover, the unmatched cost is used to deal with frame shuffling attacks, as well as frame removal and insertion attacks.

2. Matching Function and Matching Cost

Suppose that a video clip $X = {X_{1}, \dots, X_{l_{X}}}$ , where $X_{j}$ denotes the $j$ ’th frame, is modified from the original clip $Y = {Y_{1}, \dots, Y_{l_{Y}}}$ . A frame $X_{j}$ may be different from any frame $Y_{k}$ in the original clip, since the clip $X$ may be a compressed version of $Y$ . Moreover, some frames can be removed from or inserted into $X$ , and then the resulting frames can be shuffled. We assume that the number of removed frames is less than a threshold $ν_{R}$ . Similarly, let $ν_{I}$ and $ν_{S}$ denote the maximum numbers of inserted frames and shuffled frames, respectively.

We say that a frame $X_{j}$ in $X$ matches a frame $Y_{k}$ in $Y$ if $X_{j}$ originates from $Y_{k}$ . On the other hand, $X_{j}$ is called an unmatched frame if $X_{j}$ does not match any frame in $Y$ . Then, the temporal modification of a video can be represented by a function on the space of frame indices. Specifically, we define the matching function $λ$ as

Eq. 1

λ (j) = {\begin{cases} k, & if X_{j} matches Y_{k} \\ 0, & if X_{j} is an unmatched frame \end{cases},

where

j

belongs to the frame index set

{1, 2, \dots, l_{X}}

of

X

. For example, Table 1 shows a matching function, when

l_{X} = 9

and

l_{Y} = 10

. In this case,

X

is modified from

Y

by removing

Y_{2}

and

Y_{6}

, inserting new frames before

Y_{7}

, and then reversing the order of

Y_{8}

and

Y_{9}

.

Table 1

A matching function λ(j) that describes the temporal alignment between two sequences.

j	1	2	3	4	5	6	7	8	9
$λ (j)$	1	3	4	5	0	7	9	8	10

Let $d_{m} (j, k)$ denote the matching cost between two frames $X_{j}$ and $Y_{k}$ , which measures the dissimilarity between $X_{j}$ and $Y_{k}$ . The mean squared error (MSE) is employed as the matching cost.

3. Proposed Matching Algorithm

Given a video clip $X$ and its original $Y$ , we attempt to find a matching function $λ (j)$ , such that $Y_{λ (j)}$ in $Y$ is identical or similar to $X_{j}$ in $X$ for each $1 ⩽ j ⩽ l_{X}$ . The simplest matching can be achieved by the local minimization of matching costs (LMMC) via

Eq. 2

λ_{LMMC} (j) = \underset{1 ⩽ k ⩽ l_{Y}}{\arg \min} d_{m} (j, k) for 1 ⩽ j ⩽ l_{X} .

To recover from frame insertion attacks,

X_{j}

can be declared to be an unmatched frame if the minimum cost

\min_{1 ⩽ k ⩽ l_{Y}} d_{m} (j, k)

is larger than a threshold. However, in this LMMC approach, matching functions may not be one to one, and estimation results are very sensitive to the threshold.

We propose a globally optimal matching algorithm that minimizes the total matching cost subject to the one-to-one matching constraint. First, we assume that the original video clip is temporally modified by frame removal and insertion attacks only, and find the best matching function $λ^{*}$ under this monotonicity assumption. Later, in the second phase, we detect frame shuffling attacks by further matching the unmatched frames of $λ^{*}$ without imposing the monotonicity constraint.

Assuming frame removal and insertion attacks only, the matching function can be estimated by minimizing the sum of the costs of matched frames and unmatched frames, given by

Eq. 3

λ^{*} = \underset{λ ∊ Λ}{\arg \min} {\sum_{j; λ (j) > 0} d_{m} [j, λ (j)] + \sum_{l; λ (l) = 0} d_{u} (l)},

where

Λ = {λ ∣ λ (j) < λ (j^{'}) if j < j^{'}, λ (j) > 0, λ (j^{'}) > 0}

. Note that a function

λ

in

Λ

is monotonically increasing on the reduced domain {

j : 1 ⩽ j ⩽ l_{X}

and

λ (j) > 0

}, which excludes the unmatched frame indices. The unmatched cost of the

l

’th frame is defined as

Eq. 4

d_{u} (l) = \max_{j ∊ [l - δ, l + δ]; λ (j) > 0} \min_{1 ⩽ k ⩽ l_{Y}} d_{m} (j, k) .

Notice that the unmatched cost of the

l

’th frame is adaptively determined by the matched costs of the neighboring frames inside a temporal window

[l - δ, l + δ]

, where

δ

denotes the window size. In general, a large

d_{u} (l)

indicates that the neighboring frames are heavily compressed or they contain complex textures and motions. Thus, a large

d_{u} (l)

makes it easier to classify the

l

’th frame as a matched frame.

In Eq. 3, when $i$ frames, $\max {l_{X} - l_{Y}, 0} ⩽ i ⩽ ν_{I}$ , are inserted into $X$ , there are $(\binom{l_{X}}{i})$ possible choices of unmatched frames and $(\binom{l_{Y}}{l_{X} - i})$ possible choices of matched frames. Therefore, the complexity of the exhaustive minimization in Eq. 3 is proportional to $\sum_{i = \max {0, l_{X} - l_{Y}}}^{ν_{I}} (\binom{l_{X}}{i}) (\binom{l_{Y}}{l_{X} - i})$ , which is too demanding in most applications.

To reduce the complexity, we minimize the total cost function in Eq. 3 based on dynamic programming. The dynamic programming method in this work is similar to that for computing the Levenshtein distance, also called the edit distance, between two text strings.^{10, 11} The edit distance counts the minimum number of letter substitutions, insertions, or removals to convert a text string to another string. Similarly, the proposed algorithm matches a video sequence to another sequence by considering frame insertions and removals. However, whereas the distance between two letters is binary (identical or not), the distance between two frames should represent the similarity of those two frames. Thus, as compared with letter insertions or removals, it is more difficult to identify frame insertions or removals. To overcome this difficulty, the proposed algorithm defines the matching cost as MSE and the unmatched cost in Eq. 4 and minimizes the total cost in Eq. 3 to achieve reliable video matching.

Suppose that the first $j$ frames of $X$ are obtained by removing $r$ frames from the first $k$ frames of $Y$ and then inserting $i$ new frames. Note that $k = j - i + r$ , since $(j - i)$ frames in $X$ match $(k - r)$ frames in $Y$ . Let $s (j; i, r)$ denote the minimum sum of the matching costs for the first $j$ frames in $X$ when $i$ frames are inserted and $r$ frames are removed.

We compute $s (j; i, r)$ recursively. First, we initialize $s (j; i, r) = \infty$ if $j < 1$ , $i < 0$ , or $r < 0$ with the exception $s (0; 0, 0) = 0$ . Then, $s (j; i, r)$ can be obtained by finding the minimum among three cases

Eq. 5

s (j; i, r) = \min {\begin{cases} s (j - 1; i, r) + d_{m} (j, k), \\ s (j - 1; i - 1, r) + d_{u} (j), \\ s (j; i, r - 1) \end{cases}},

where

k = j - i + r

. When

X_{j}

matches

Y_{k}

, the matching cost

d_{m} (j, k)

is added to

s (j - 1; i, r)

, which is the first term in Eq. 5. When

X_{j}

is an inserted frame, the unmatched cost

d_{u} (j)

is added to

s (j - 1; i - 1, r)

. If

Y_{k}

is a removed frame,

s (j; i, r - 1)

remains unchanged. In this way, we compute all

s (j; i, r)

inductively. Finally, the minimum sum of costs for the whole sequence is given by

\min_{\max {0, l_{X} - l_{Y}} ⩽ i ⩽ ν_{I}} s (l_{X}; i, l_{Y} - l_{X} + i)

. While computing the partial costs in Eq. 5, we also record the minimum conditions, which we can trace back to get the matching function in Eq. 3.

Next, let us consider frame shuffling attacks, in addition to frame removal and insertion attacks. In Eqs. 3, 5, we impose the monotonicity constraint that the matching function for matched frames is monotonically increasing, and shuffled frames hence can be classified as unmatched frames. Therefore, when $X_{j}$ is classified as unmatched, it can be further matched to a frame in $Y$ , which is not matched by the function in Eq. 3. More specifically, $λ^{*} (j)$ is refined by

Eq. 6

λ^{*} (j) = \underset{k ∊ U}{\arg \min} d_{m} (j, k) if \min_{k ∊ U} d_{m} (j, k) < d_{u} (j),

where

U

denotes the indices of frames in

Y

that are not matched by Eq. 3. Notice that we reduce the total matching cost by changing the category of

X_{j}

from an unmatched frame to a shuffled frame when its matching cost is smaller than the unmatched cost.

4. Experimental Results

The performance of the proposed algorithm is evaluated using five common intermediate format (CIF) sequences Foreman, Paris, Mobile, Tempete, and Carphone, each of which consists of 100 frames. We also use 33,220 video clips, selected from 20 Korean movies.¹² Each clip consists of 100 frames of resolution $360 \times 240$ , and has a frame rate of either 24 or 30 frames per second. Thus, 3,322,500 frames are used in total.

To match a video clip with another clip, we minimize the total cost in Eq. 3. It can be shown that the dynamic programming method requires also $O (l_{X} l_{Y})$ recursion steps. We use a personal computer with an Intel Pentium D $3 - GHz$ processor for simulations. To match a video clip of 100 frames to another clip, the proposed algorithm requires about $5.32 ms$ for the dynamic programming method.

Table 2 shows the matching error probabilities. A matching error probability is defined as the rate of an estimated matching function being different from the true matching function. The sequences are compressed by the H.264/AVC standard with QP 20 and 35, which yield about 42.0 and $30.8 dB$ peak signal-to-noise ratios (PSNRs) on average, respectively. The sequences then go through frame removal attacks $(R)$ , frame insertion attacks $(I)$ , and frame shuffling attacks $(S)$ . The numbers of removed and inserted frames are randomly selected from 1 to 10 and from 1 to 3 ( $ν_{R} = 10$ and $ν_{I} = 3$ ), respectively. An inserted frame is constructed by averaging two adjacent frames in the original sequences. For shuffling attacks, the number of pairs of adjacent frames is randomly selected from 1 to 3 $(ν_{S} = 3)$ , and each pair of frames swaps their places.

Table 2

Comparison of matching error probabilities when a video is modified from its original through frame removal (R) , insertion (I) , shuffling (S) , and data compression attacks.

H. 264/AVC		Temporal attacks (%)
Algorithm	QP	R	I	R+I	R+I+S
Cheng⁵	20	0.00	0.46	0.66	N/A
	35	0.03	1.75	2.12	N/A
LMMC	20	0.27	0.51	0.59	0.59
	35	1.71	3.43	4.11	4.12
Proposed	20	0.00	0.00	0.00	0.00
	35	0.00	0.09	0.10	0.17

We compare the performance of the proposed algorithm with those of Cheng’s algorithm⁵ and the LMMC approach. In Ref. 5, frame shuffling attacks are not considered, and the matching fails under these attacks. Since the weights for the two-frame integration model are overparameterized for insertion attacks, matching errors occur even when QP is as low as 20. The LMMC approach is also vulnerable to insertion attacks, since it simply searches the best matching frame for each individual frame. On the other hand, the proposed algorithm provides much better performance by globally minimizing the total matching cost. For example, when QP is 35 and frame removal and insertion attacks are combined $(R + I)$ , the matching error probability of the proposed algorithm $(= 0.10 %)$ is about 41 times lower than that of LMMC $(= 4.11 %)$ and about 21 times lower than that of Cheng’s algorithm $(= 2.12 %)$ . As QP increases, the qualities of modified videos degrade and the matching error probabilities get higher. However, the proposed algorithm provides significantly better performance than the conventional algorithm, even in these severe conditions.

5. Conclusion

We propose a temporal alignment algorithm between two video sequences, when a video is modified from the other through frame removal, insertion, and shuffling attacks as well as data compression attacks. We define a cost function for global matching errors and develop a dynamic programming algorithm to minimize cost efficiently. Experimental results show that the proposed algorithm provides a significantly lower probability of matching errors than the conventional algorithm.

Acknowledgment

This work was supported partly by the Ministry of Knowledge Economy, Korea, under the Information Technology Research Center support program supervised by the Institute of Information Technology Advancement (grant number IITA-2008-C1090-0801-0017) and partly by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MEST) (number R01-2008-000-20292-0).

references

1.

Z. Li, L. Gao, and A. K. Katsaggelos, “Locally embedded linear subspaces for efficient video indexing and retrieval,” 1765 –1768 (2006). Google Scholar

2.

J. Yuan, W. Wang, J. Meng, Y. Wu, and D. Li, “Mining repetitive clips through finding continuous paths,” 289 –292 (2007). Google Scholar

3.

E. Grimson, P. Viola, O. Faugeras, T. Lonzano-Perez, T. Poggio, and S. Teller, “A forest of sensors,” Proc. DARPA Image Understanding Workshop, 1 45 –50 (1997) Google Scholar

4.

D. Delannay, C. de Roover, and B. Macq, “Temporal alignment of video sequences for water-marking systems,” Proc. SPIE, 5020 481 –492 (2003). Google Scholar

5.

H. Cheng, “Temporal registration of video sequences,” 489 –492 (2003). Google Scholar

6.

Y. Y. Lee, C. S. Kim, and S. U. Lee, “Video fingerprinting based on frame skipping,” 2305 –2308 (2006). Google Scholar

8.

E. T. Lin and E. J. Delp, “Temporal synchronization in video watermarking,” IEEE Trans. Signal Process., 52 3007 –3022 (2004). https://doi.org/10.1109/TSP.2004.833866 Google Scholar

10.

V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” Sov. Phys. Dokl., 10 707 –710 (1966). Google Scholar

11.

G. Navarro, “A guided tour to approximate string matching,” ACM Comput. Surv., 33 (1), 31 –88 (2001). Google Scholar

12.

Y. Y. Lee, “Temporal feature modulation for video watermarking,” Seoul National University, (2008). Google Scholar

Citation Download Citation

Young-Yoon Lee, Chang-Su Kim, and Sang-Uk Lee "Video frame-matching algorithm using dynamic programming," Journal of Electronic Imaging 18(1), 010504 (1 January 2009). https://doi.org/10.1117/1.3092367

Published: 1 January 2009

Access the abstract

JOURNAL ARTICLE
3 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

CITATIONS

Cited by 8 scholarly publications and 4 patents.

Explore citations on Lens.org

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Video

Video surveillance

Computer programming

Digital watermarking

Video compression

Data compression

Affine motion model

1.

Introduction

2.

Matching Function and Matching Cost

Eq. 1

Table 1

3.

Proposed Matching Algorithm

Eq. 2

Eq. 3

Eq. 4

Eq. 5

Eq. 6

4.

Experimental Results

Table 2

5.

Conclusion

Acknowledgment

references

Show All Keywords

Keywords/Phrases

Search In:

Publication Years