# Evolutionary adaptive eye tracking for low-cost human computer interaction applications

**Yan Shen**

Inha University, 235 Yong-Hyun Dong, Nam Ku, Incheon, Republic of Korea

**Hak Chul Shin**

Inha University, 235 Yong-Hyun Dong, Nam Ku, Incheon, Republic of Korea

**Won Jun Sung**

Inha University, 235 Yong-Hyun Dong, Nam Ku, Incheon, Republic of Korea

**Sarang Khim**

Inha University, 235 Yong-Hyun Dong, Nam Ku, Incheon, Republic of Korea

**Honglak Kim**

Inha University, 235 Yong-Hyun Dong, Nam Ku, Incheon, Republic of Korea

**Phill Kyu Rhee**

Inha University, 235 Yong-Hyun Dong, Nam Ku, Incheon, Republic of Korea

*J. Electron. Imaging*. 22(1), 013031 (Mar 01, 2013). doi:10.1117/1.JEI.22.1.013031

#### Open Access

## Abstract

**Abstract.**
We present an evolutionary adaptive eye-tracking framework aiming for low-cost human computer interaction. The main focus is to guarantee eye-tracking performance without using high-cost devices and strongly controlled situations. The performance optimization of eye tracking is formulated into the dynamic control problem of deciding on an eye tracking algorithm structure and associated thresholds/parameters, where the dynamic control space is denoted by genotype and phenotype spaces. The evolutionary algorithm is responsible for exploring the genotype control space, and the reinforcement learning algorithm organizes the evolved genotype into a reactive phenotype. The evolutionary algorithm encodes an eye-tracking scheme as a genetic code based on image variation analysis. Then, the reinforcement learning algorithm defines internal states in a phenotype control space limited by the perceived genetic code and carries out interactive adaptations. The proposed method can achieve optimal performance by compromising the difficulty in the real-time performance of the evolutionary algorithm and the drawback of the huge search space of the reinforcement learning algorithm. Extensive experiments were carried out using webcam image sequences and yielded very encouraging results. The framework can be readily applied to other low-cost vision-based human computer interactions in solving their intrinsic brittleness in unstable operational environments.

## Introduction

In the last three decades, many eye-tracking approaches have been proposed aimed at diverse objectives and applications.^{1}^{,}^{2} Automatic eye tracking has been employed in many application areas, since it is strongly connected with human perception, attention, and cognitive states. Eye-tracking technology can be employed in unobtrusive and hands-free human computer interaction (HCI) as an effective tool of interaction and communication between computers and people. Even though a lot of effort has been invested in developing the technology, the problem of eye tracking is still very challenging, owing to changing operation environments with varying types of illumination, viewing angle, scale, individual eye shape, and jittering.^{3} Most successful eye-tracking techniques in commercial areas pose either the requirement of a high-cost image capturing device (e.g., camera and lens) or very limited operation within a strongly controlled situation.^{4}^{,}^{5} Recently, much interest in eye-tracking technology is being generated in the areas of HCI for web usability, advertising, smart TV, and mobile applications. However, the high cost or the limitation of conventional eye-tracking techniques can hardly be employed in low-cost commercial applications. Without lighting control and controlled situations, more available and general eye-tracking technology that reduces costs as well as simplifies the complicated initial user setup is necessary for a successful HCI application. Few researchers have tackled the more general vision technology for eye tracking in a loosely controlled environment with low-cost equipment.^{6}^{,}^{7} Most techniques, however, are far from practical in current and coming vision-based HCI applications due to their intrinsic brittleness in changing image-capturing environments.

One promising direction is an adaptive selection of different eye-tracking algorithm structures and the control of associated thresholds/parameters considering environment changes. However, the decision of an optimal algorithm structure with its adaptable thresholds/parameters is a very complicated problem in the area of image processing and computer vision in general.^{8} Some techniques for adaptive thresholds and flexible algorithm structures can be found in evolutionary algorithm (EA) approaches such as genetic algorithm (GA), genetic programming (GP),^{9} and other evolutionary learning methods. The EA that evolves in a manner resembling natural selection can be employed to solve complex problems even if its creator does not fully understand the phenomenon.^{10} However, the evaluation of a real-time system using the EA approach very often encounters a critical difficulty due to a significant amount of computation time coming from repetitive computations for many individuals over several generations. The high computational cost of the EA is a serious limiting factor. Another limitation of most evolutionary algorithms is their ambiguity in discriminating genotype and phenotype. In nature, the fertilized ovum cell undergoes a complex process known as embryogenesis to become a mature phenotype.^{11} The outward, physical manifestation of an organism is anything that is part of an observable structure, function, or behavior of a living organism. The observable characteristics of an organism internally coded, inheritable information is carried by the genetic code, the inherited instruction which carries acquired characters. An organism is usually exposed to sequences of events and reinforcements, and the same genotype encounters diverse environments where different behavioral actions are optimal. Learning is required in the whole lifetime of an individual that enforces a selection pressure, even though a correct phenotype from birth seems to be produced.^{12} Some successful evolutionary approaches combined with learning technologies can be found in robot controlling areas.^{13}

Most parameter control approaches require a precise model that describes interactions with environments. In many cases, however, it is very complicated or impossible to model the interaction for precise real-time task processing. In a real world situation, offline learning with an imprecise model frequently cannot produce an acceptable performance.^{14} The reinforcement learning (RL) approach is one of the promising approaches to solve such problems, since it is based on “learning by experience.” The RL approach is very effective in controlling a system by interacting with its external environment.^{15} A bunch of model-free value-function-based RL algorithms have been studied, such as the $Q$-learning, SARSA, and AC methods.^{12}^{,}^{16} The RL approach can optimize system parameters through interactions with a perceived environment.^{17}^{,}^{18} An RL-based eye-tracking scheme can learn state-action probabilities, provide performance-adaptive functionality without requiring a system model, and assure the convergence to an optimal action. However, a very large amount of repetitive trial and error is necessary in order to find the optimal performance, because of the curse of a huge search space in the visual tracking problem. The enormous amount of learning time required for a low-cost eye tracking task in varying environments prohibits the RL algorithm to be employed, since a huge search space for deciding actions requires a high computation time for learning by experience.

We propose an evolutionary adaptive framework for a high-performance eye tracking scheme for a low-cost vision-based HCI. The framework for efficient and robust eye tracking behaves adaptively by combining the evolutionary and interactive learning paradigms. It devises a performance-adaptive eye tracking scheme where possible configurations of eye tracking are represented as genotype and phenotype control spaces in terms of algorithm structure and thresholds/parameters. The EA explores to find an optimal eye tracking genetic code in the genotype control space where an illumination environment measured by image quality analysis and represented by image context.^{19}^{,}^{20} The EA can evolve relatively long-term behaviors than the reactive behaviors of the RL,^{14} while the RL provides quick interactive adaptation in a real world environment. The EA performs mainly in the genotype learning using a pre-collected training set in an offline fashion. The huge control space of algorithm structures and thresholds/parameters is partitioned in accordance with a perceived image context, and the EA determines a genetic code that fits for the perceived image context. The RL’s main role is to organize the phenotype of the scheme interactively. The RL algorithm defines its internal environmental state from the partitioned genotype control space and mainly performs a quick and interactive adaptation. In this way, the proposed method perceives the changing environment of the real world and learns an effective algorithm structure and associated thresholds/parameters for optimal eye tracking performance. The major contributions of this paper are summarized as follows:

- Instead of posing a high-cost image capturing device or very limited situation, a performance guarantee is achieved by formulating eye tracking into the dynamic control problem of optimizing an algorithm structure with associated thresholds/parameters.
- Tackling the dilemma of the EA and the RL algorithms, i.e., the intrinsic non-real-time property of the EA and the curse of huge search space of the RL algorithm, the framework can efficiently improve the performance of the eye tracking scheme in terms of accuracy and speed. The EA approach can reduce the curse of high dimensionality and the huge number of trials of the RL algorithm, and accelerate the convergence speed of the RL algorithm compared with learning from scratch.
- The proposed framework defines the RL internal environment in connection with the genotype control space instead of an input image as the RL environment in Refs.
^{8}and^{21}, which cannot fully utilize the interactive characteristics of the RL algorithm. While their approach can be thought to take an image sequence as stimuli, our approach associates the RL environment state with thresholds/parameters of the eye tracking scheme. As a result, the proposed framework can explore the consecutive and interactive threshold/parameter space influenced by a changing external environment, and thus fully take advantage of the RL algorithm by providing very flexible and high performance in the eye tracking scheme. - The framework can be readily applied to other vision-based HCI applications in solving their intrinsic brittleness under changing image capturing environments.

This paper is organized as follows. In Sec. 2, the evolutionary adaptive framework is discussed for controlling eye tracking parameter spaces. The genotype control space and evolution are discussed in Sec. 3. The phenotype manifestation using the RL algorithm is given in Sec. 4. In Sec. 5, the performance measures of eye tracking are discussed. Finally, the experimental results and concluding remarks are given in Secs. 6 and 7, respectively.

## Evolutionary Adaptive Framework for Eye Tracking Parameter Control

The rationale behind the proposed eye tracking scheme with evolutionary adaptive capability is the recurrence phenomenon of external stimulus, here image quality variation influenced by the changes of the image-capturing environment mainly from illumination, viewing angle, and individual eye shape. The proposed scheme explores an optimal algorithm structure and associated threshold/parameter space guided by the observation of image variations that mainly affects scheme performance. The proposed scheme consists of three modules, as shown in Fig. 1: the EA module, the eye-tracking module, and the RL module. It continuously perceives and interacts with the external environment, and adapts its algorithm structure, thresholds, and parameters. The proposed framework includes genotype evolution and phenotype adaptation. The first one performs long-term learning based on the EA, and the second one performs interactive phenotype learning using the RL algorithm.

The performance of an eye-tracking scheme is highly dependent upon the choice of algorithms in each step and their associated parameters and thresholds.^{20} It is best formulated into a nontrivial dynamic control problem of deciding on an eye-tracking algorithm structure and associated thresholds and parameters. Considering the primary cause of degrading the eye-tracking performance is image quality variation due to illumination, pose, and eye shape, we introduce two level intelligent control mechanisms to optimize the eye-tracking scheme. The first level control mechanism employs EA for the genotype evolution, which determines a possible optimal scheme configuration for a perceived external environment as an image context, where an external environment measured by k-means clustering and image quality analysis.^{19}^{,}^{20} The second one takes advantage of the RL algorithm for a phenotype manifestation of the genetic code.

The eye area is located using the face location and eye area location methods in ^{20}, and eye tracking is processed based on the eye tracking method in ^{22}. The initial eye centroid is estimated by preprocessing, feature extraction, and a partial Hough transformation. Preprocessing is carried out using an adaptively selected algorithm substructure and its associated thresholds, and feature extraction is performed to produce a contour or edge image. The contour or the edge image is processed by the partial Hough transform with adjusted arc angle parameters of the iris boundary to determine an optimal tracking point. Finally, the Kalman filter is applied, and the next eye centroid is predicted. The above steps are repeated until eye tracking is terminated. The eye-tracking control space, i.e., the possible algorithm structures and the ranges of thresholds/parameters, is determined based on prior knowledge or experiments considering the tradeoff between tracking accuracy and execution time constraints. The estimation steps of the eye centroid, the eye centroid prediction by the Kalman filter,^{23}^{,}^{24} and their control space are discussed in the remainder of this section.

The preprocessing step considered here consists of histogram equalization, Retnix with one threshold,^{25} and end-in contrast stretching with two parameters.^{26} For example, possible algorithm structures of the preprocessing step are histogram equalization, Retnix, Retnix with histogram equalization in a serial combination, or the end-in contrast stretching. The feature extraction step can be binarization, the Canny edge detection algorithm,^{27} binarization and the Canny in parallel combination with an AND operation, or binarization and the Canny in parallel combination with an OR operation, where binarization and the contour in serial combinations is denoted by binarization for simplicity. Feature extraction is selectively carried out using one of the above four algorithm structures with different thresholds. Binarization has one threshold, and the Canny algorithm has two thresholds for edge detection, where the first is related to edge linking, and the second is related to finding the initial segments of the strong edges.

In our eye tracking system, the iris boundary is approximated by a circle since the proposed method aims at real-time performance using low resolution images. The algorithm detects the center of the eye (iris) by calculating the outer boundary of the iris image. We adopt a modified Hough transform method, called partial Hough transform, to extract features that are less affected by noise and eyelid occlusions. Two portions of the circle (partial circles) are matched instead of the entire circle to minimize the occlusion effects of the upper and lower eyelids. A four-dimensional control space for eye centroid tracking includes the angle values indicating the partial iris boundaries of interest, i.e., the partial circles (see Fig. 2). The partial Hough transform algorithm for circle fitting can then be described as follows. The iris outer circle can be represented as

The two stage algorithm formulated in ^{23} is employed: the first stage finds the eye circle center, and the second stage estimates normal directions to the outer iris boundary points on the partial circles intersecting the eye centroid. Let $(x,y)$ be the gradient direction point on the outer iris boundary. A mapping from $(x,y,\varphi )$ triples into the center parameters $(a,b)$ produces a straight line, where $\varphi $ is the angle of the gradient direction. The intersection of many of these lines identifies the coordinates of the eye centroid. The relation between $(x,y,\varphi )$ and $(a,b)$ is given as

The Kalman filter^{28} is a recursive approach for the discrete linear filtering problem by estimating a state process that minimizes the squared error.^{29}^{,}^{30} In this paper, the Kalman filter is employed to approximate the tracking of eye centroids formulated as a Bayesian tracking problem.

Assuming that the motion of the eye centroid has a constant velocity $Ft$, the covariance matrices of the state transition model and the process noise model can be determined as follows:^{29}

^{31}The parameters $r$ and $q$ can be pre-computed by running the filter offline or determining the state-state values.

^{32}However, the noise does not remain constant in eye tracking due to the uncertainties of dynamically changing environments. In this context, the parameters $r$ and $q$ are evolved by the EA as discussed in this session and adapted by the RL method in the next section in order to achieve optimal performance in our eye-tracking scheme.

## Genotype Evolution

In this section, we discuss how to encode the variation of the algorithm structure and associated thresholds and parameters of the eye-tracking scheme into a genetic code so that the genotype control space is explored for evolutionary optimization. The genotype of the tracking scheme is denoted by genetic codes in the genotype control space, each of which represents an algorithm structure and thresholds/parameters, and thus an optimal algorithm configuration that has evolved for a given external environment is denoted by a genetic code. The algorithm structure of the tracking scheme is divided into the preprocessing, feature extraction, partial Hough transform, and Kalman prediction as discussed previously. Given an external environment, an optimal algorithm structure with associated thresholds/parameters is determined in accordance with a perceived image context.

In general, a context can be any information that affects the performance of a system of interest. Image context might be affected by lighting direction, brightness, contrast, and spectral composition. The concept of an image context with image quality analysis technology can improve system performance.^{33} Recently, image quality analysis methods have been successfully applied to diverse applications such as image storage, compression, communication, display, segmentation, and recognition,^{34} as well as used to decide selective preprocessing for an adaptive illumination normalization using adaptive weighting parameters.^{19}^{,}^{20} In this paper, an image quality analysis is adopted for the categorization of the variation of input image quality that is influenced by the external environment. For simplicity of discussion we employ only input image illumination changes as a clue to perceive the changes of an external environment. The Haar wavelet-based Bayesian classification is performed for face and eye area detection^{20} before eye tracking begins.

Each substructure is constructed by combining action units. An algorithm structure can be divided into the substructure sequence, and each substructure is configured by the combination of action units which are the basic building blocks of the system. Let $Z$ be the set of thresholds and/or parameters of an action unit (AU). Each action unit has associated thresholds and/or parameters if they are required, and are denoted as follows:

$\Psi /\theta $ | Algorithm substructure | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Preprocessing | Feature extraction | Partial Hough transform | Kalman | |||||||||||

$\Psi Pre$ | $\theta RX$ | $\theta ECS1$ | $\theta ECS2$ | $\Psi FE$ | $\theta BN$ | $\theta CN1$ | $\theta CN2$ | $\theta PHT1$ | $\theta PHT2$ | $\theta PHT3$ | $\theta PHT4$ | $\theta KF1$ | $\theta KF2$ | |

No. of bits | 2 | 4 | 4 | 4 | 2 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |

$Rmin$ | 181 | 32 | 160 | 68 | 4 | 32 | 148 | 230 | 278 | 0 | 0.005 | 0.05 | ||

$Rmax$ | 245 | 96 | 224 | 188 | 20 | 64 | 180 | 262 | 310 | 32 | 0.015 | 0.15 |

If the feature extraction step is determined to have four types of algorithm structure, it is denoted by its 2-bit vector of the genetic code as shown in Table 2.

Bit vector of $\Psi FE$ | 00 | 01 | 10 | 11 |
---|---|---|---|---|

Feature extraction algorithm substructure | Binarization | Canny | Binarization $||AND$ Canny | Binarization $||OR$ Canny |

Let $\theta A$ be a threshold (parameter) of an action unit $A$. The pivot $\theta Apivot$ is defined as the central value of a threshold range of $\theta A$ in the phenotype control space. Let $\theta A\u2009min$ and $\theta A\u2009max$ denote the minimum and the maximum threshold values of $\theta A$, respectively. The interval of the threshold pivots between adjacent bit vector, $\theta A\alpha $ is calculated by

$K$ (subgenetic code of $\theta BN$) | $\theta BNpivot$ ($K$) binarization pivot | $K$ (subgenetic code of $\theta BN$) | $\theta BNpivot$ ($K$) binarization pivot |
---|---|---|---|

0000 | 68 | 1000 | 132 |

0001 | 76 | 1001 | 140 |

0010 | 84 | 1010 | 148 |

0011 | 92 | 1011 | 156 |

0100 | 100 | 1100 | 164 |

0101 | 108 | 1101 | 172 |

0110 | 116 | 1110 | 180 |

0111 | 124 | 1111 | 188 |

## Phenotype Manifestation Using the RL Algorithm

As discussed in the previous section, the EA evolves the system configuration in the dynamic control space of the algorithm structure and its associated thresholds/parameters in terms of genotype representation so that they are put together into the eye tracking scheme to optimize performance. The role of the EA is to find an appropriate scheme genotype, i.e., a genetic code representing an optimal eye tracking configuration based on a perceived external environment, (i.e., a perceived image context here) and the fitness evaluation. Since real world external environment changes are not fully predictable in the EA evolution step, the RL algorithm is employed to seek a precise phenotype manifestation of eye tracking in terms of thresholds/parameters and the reward. Combining the EA evolution and the RL adaptation capabilities, not only can the curse of huge control space of possible eye tracking configurations in applying the RL algorithm be handled, but also the difficulty of the EA in a real-time adaptation.

In our approach, the dynamic control space of the EA is not the same as that of the RL algorithm, but they are interrelated through the performance optimization in a real world environment. A genetic code, generated by the EA for each image context, defines a related phenotype control space which describes by the ranges of thresholds and/or parameters. For simplicity and real-time consideration, the algorithm structure in a genetic code will not be changed in the phenotype manifestation, and it is also intuitively suitable with a real world evolutionary adaptation phenomenon. The RL algorithm treats the adaptation of the thresholds/parameters as a consecutive decision process, i.e., the RL state transition, to obtain an optimal configuration of eye tracking. Regarding the eye-tracking optimization as the events of consecutive trials for deciding the RL state observed in discrete time, a Markov decision process (MDP) model which is necessary to use the RL algorithm need to be justified. In theory, the RL algorithm cannot be applied exactly if the Markov property is not satisfied. However, the RL application developed based on the Markov assumption is still valid in many cases, to approximate and understand system behavior. Even when a state signal does not satisfy the Markov property, the RL algorithm can be employed in more complex and realistic non-Markov cases, and the RL algorithm has been successfully applied in many cases by approximating the state as a Markov state.^{16}

In general, an MDP approach for a RL algorithm requires five-tuples: states, actions, transition function, rewards and discount factor.^{35} The action can be any decision/behavior that the RL agent needs learn, and a state can be any factor that influences the agent’s decision making. In this paper, each phenotype manifestation, denoted by the phenotype control space of associated thresholds/parameters defines a state in the RL internal environment (see Fig. 1), and interacts with the RL agent to explore an optimal threshold/parameter set maximizing the eye-tracking performance. Here, the action is defined as the move to a next state that decides next phenotype manifestation in the phenotype control space.

The produced action activates a state transition of the internal environment from the current internal state in the phenotype control space into a new state, and the RL agent receives a reinforcement signal from the internal environment. One can notice that the proposed approach is distinguished from Refs. ^{8} and ^{21} where an image itself is modeled as an RL state, and the parameter decision is as an action. Their approach can be thought to take an image sequence as stimuli and to produce a segmented image. They cannot fully utilize the interactive characteristics of the RL algorithm in full, since the agent can alter the state at each time-step by taking actions.^{36} On the other hand, our approach relies on consecutive and interactive threshold/parameter transitions influenced by changing external environments (see Fig. 1), which are modeled as a Markov decision process. In this paper, we will stay with the concepts of environment, agent, reward, and action in a usual RL. However, we use the terms internal environment and internal reward to avoid a possible confusion between external environment and external feedback. On the other hand, the internal environment can be interpreted as an intrinsic motivation used in cognitive science, where intrinsically motivated behavior is emphasized as a vital factor for intellectual growth.^{37} RL action can also be interpreted as internal decision instead of an action as a human’s decision to move his muscles in a certain direction.

It can be interpreted that the next values of thresholds/parameters are mainly decided by the current internal state, i.e., the current values of thresholds/parameters of the tracking scheme. The RL internal states are members of a finite set denoting all possible eye tracing scheme configurations in terms of associated thresholds/parameters in the phenotype control space, which is related and limited by a genetic code influenced by the external environment (see Fig. 1). Thus, the RL internal state transits randomly from one state to another within the limited control space in discrete time-steps.

In the RL algorithm, an episode is defined to separate the agent-environment interaction into subsequences. An episode can be thought as a temporal unit of repeated interactions, and an episode has a standard starting state and a terminal state.^{16} The RL starting state is declared either when the eye-tracking confidence is degraded below a predefined criterion, say the RL starting threshold $\eta $, or an alert signal is received from the external environment through the external feedback (see Fig. 1). The RL terminal state is declared when the eye-tracking confidence reaches a predefined criterion, say RL terminal threshold $\zeta $, or it does not improve any more. The current internal state is the estimation of the time-discounted amount of the reward to be expected starting from that internal state and continuing to act according to its policy.

Considering real world constraints on computation resources, the set of internal states in the phenotype control space limited by a genetic code determined by a perceived image context

Substructure | Preprocessing | Feature extraction | Partial Hough transform | Kalman | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Threshold/parameter | $\theta RX$ | $\theta ECS1$ | $\theta ECS2$ | $\theta BN$ | $\theta CN1$ | $\theta CN2$ | $\theta PHT1$ | $\theta PHT2$ | $\theta PHT3$ | $\theta PHT4$ | $\theta KF1$ | $\theta KF2$ |

Discrete range precision | 8 | 8 | 8 | 8 | 8 | 8 | 4 | 4 | 4 | 4 | 16 | 16 |

A finite set of actions is available in each internal state, and $\alpha t\u2208A(st)$ indicates the decision action of changing the thresholds/parameters of the eye tracking scheme at time-step $t$. Let $at=[\zeta t,1,\zeta t,2,\u2026,\zeta t,p]$ be an action at time-step $t$, where $\zeta t,i$ is $\u2212dt,i$, 0, or $+dt,i$ meaning that the negative direction movement of $d$ units, stationary, or positive direction movement of $d$ units in the $i$’th coordinate (index) of the phenotype control space at time-step $t$.

Given threshold/parameter values, the tracking scheme will generate possible eye centroids within a search window. Let $st=[\theta t,1,\theta t,2,\u2026,\theta t,p]$ be a phenotype control vector at time-step $t$, where the dimension of the threshold/parameter space is $p$ and $\theta t,i$ is an RL index of the lookup table representing a threshold or a parameter. The index has the discrete range precision. For example, $\theta BN$ has the discrete range precision “8” in Table 4, meaning that the index value of $\theta BN$ is one of the values in ${0,1,\u2026,7}$. Table 4 illustrates the state vector of the proposed eye tracking scheme with its discrete range precision.

Recall that the pivot $\theta Apivot$ is defined as the central value of a threshold range of $\theta A$ in the phenotype control space. The possible thresholds (parameters) of a RL state are decided in the RL lookup table as illustrated in Table 5 for the binarization threshold. Let $\chi $ be the interval between adjacent RL thresholds (parameters) for $\theta Apivot$. The RL lookup table is generated using the following equation:

The internal reward function which is indirectly influenced by the external feedback defines the goal in our reinforcement learning (see Fig. 1). It maps each internal environment state and action pair to an expected internal reward. The reinforcement learning specifies how the agent alters its threshold and/or parameter decision policy as a result of its experience in eye tracking. The goal is to maximize the total amount of expected internal rewards over the long run. In this context, the immediate reward of the proposed RL algorithm is defined as follows: if the eye tracking is successful using Eq. (29) for each image frame, the high score is returned as the internal reward. Otherwise, the low score is returned as the reward where the high score and the low score are determined experimentally.

The aim of the RL algorithm is to optimize its long-term performance using the feedback of one-step immediate performance of eye tracking. We choose a temporal difference (TD) learning approach since it can provide an online, fully incremental learning capability, and since the TD approach learns from each transition regardless of what subsequent actions are taken.^{16} The TD method also learns directly by experience without any knowledge of environment dynamics such as the model of its reward and next-state probability distributions. Two popular TD algorithms are SARSA and $Q$-learning, and they are different only in the estimation policy, i.e., on-policy and off-policy. Considering the real-time requirement of eye tracking, Watkins’s one-step $Q$-learning^{12} is selected which is relatively faster to converge than SARSA.^{16} The one-step $Q$-learning algorithm estimates $Q*$ which makes the action-value functions, called $Q$-functions, and is an important algorithm of reinforcement learning with its simplicity, effectiveness, and model-free characteristic.^{38} The one-step $Q$-learning method is a common off-policy temporal difference control algorithm. $Q$-learning accumulates optimal actions by evaluating action–value $Q(s,a)$ to construct sate-action table. The action-value updating equation of the one-step $Q$-learning is given as follows:^{12}

^{16}

^{,}

^{36}

The policy is the rules for selecting actions given the values of the possible next states, and the greedy policy is deterministic and takes the action with the maximum $Q$-value for every state, i.e.,

^{36}as follows: the joint action set $A=A1\xd7\cdots \xd7Ai\xd7\cdots \xd7An$, where $Ai$ is the set of actions of the $j$’th agent, the state transition probability function $f:\u2009\u2009S\xd7A\u2192S$, and the reward function $ri:\u2009\u2009S\xd7A\xd7S\u2192R$. The $Q$-learning equation and the greedy policy at time-step $t$ are defined as follows:

^{16}Our multi-agent $Q$ learning algorithm is given in Algorithm 1.

Repeat |

Step 1. Choose $a$ from $s$ using $\epsilon $-greedy policy. |

Step 2. Take the actions, observe an immediate reward $r$ by calculating the internal reward (see Algorithm 2), and observe the new internal state $s\u2032$. |

Step 3. Calculate the following: $Qt+1(s,a)=Qt(s,a)+\alpha [rt+1+\gamma maxa\u2032\u2009Qt(s\u2032,a\u2032)\u2212Qt(s,a)]$ |

Step 4. $s\u2190s\u2032$ |

Until the step limitation per frame $M$ is reached or the success tracking criteria is satisfied [see Eq. (29)]. |

The whole evolutionary adaptive process for low-cost eye tracking is given in Algorithm 2. The framework learns, through repeated interactions with each other, to autonomously optimize eye-tracking performance at run-time. The limited system memory and the real-time delay constraints of the eye-tracking complementary accelerate the learning algorithm that exploits the perceived environmental knowledge about a system’s external dynamics.

Repeat |

Step 1. Perceive the image context. |

Step 2. Find an optimal genetic code for the perceived image context, decide the phenotype control space, and initialize all $Q(s,a)$ values arbitrarily. |

Step 3. Repeat for consecutive image frames |

Step 3.1. Perform $Q$-learning by calling Algorithm 1and decide the optimal thresholds/parameters. |

Step 3.2. Perform the eye tracking until $\tau >\zeta $. |

Step 4. If $\tau >\eta $, go to Step 3. |

Step 5. If $\tau \u2264\delta $ (i.e., the eye tracking confidence $\tau $ is below the EA restarting threshold frequently where $\delta (<\eta )$, go to Step 1. |

Until termination. |

## Performance Measures of Eye Tracking from Image Sequences

This section discusses the offline measures of eye-tracking performance such as recall, precision, harmonic mean, and the real-time measure, called real-time tracking confidence $\tau $. Given $N$ image frame sequence in a video be denoted as $V=(I1,I2,\u2026,IN)$, the sequence of ground truth iris area will be compared with that of tracked iris areas for offline measures. Let $X=(x1,x2,\u2026,xN)$ and $Y=(y1,y2,\u2026,yN)$ be the sequence of ground truth iris areas and that of tracked iris areas, respectively. Owing to the evaluation method from information retrieval research, tracking performance can be measured by recall and precision.^{39} The overlapped area divided by the union of the ground truth and tracked (detected) iris area is used for measuring a successful tracking in an image frame $Ii$.^{40} The inner circle area of the iris is excluded to prohibit the effect of possible light source reflection as shown in Fig. 3.

^{40}a tracking is considered to be correct if its overlap with ground truth bounding box is larger than 50%, i.e., $tm=0.5$. In this paper, $tm$ is decided as 0.6 since we require a more strict constraint for measuring the successful eye tracking.

**F3 :**

Ground truth and overlapped iris areas for measuring a successful tracking, where the inner circle area of the iris is excluded to prohibit the effect of possible light source reflection in the center of iris: (a) the ground truth model of the iris, and (b) the overlapped iris area definition as the union of the ground truth and the detected iris areas.

Regarding the situations where a ground truth iris area cannot be defined due to heavy noise, and a tracked iris area cannot be found due to a shortage of the tracking algorithm, we define $\Vert \xb7\Vert $ operations as follows:

^{41}and the details can be found in Sec. 6.2.

Similarly, the average tracking rate (AR) for the whole video sequence $V=(I1,I2,\u2026,IN)$ is defined as follows:

## Computational Experiments

We have tested the tracking performance of the single agent $Q$-learning and multiagent $Q$-learning algorithms using the evolutionary adaptive eye tracking framework. The EA module decides the phenotype algorithm structure and genetic code using the training set of image sequences, where the phenotype algorithm structure of the preprocessing is determined as the histogram equalization, and that of the preprocessing is determined as binarization. Thus, the tracking algorithm structure is determined as the histogram equalization in the preprocessing, binarization in the feature extraction, the partial Hough transform, and the Kalman filter in the EA module. That is, the phenotype algorithm structure is determined and represented as

The image qualities of the test video set is generated as relatively good, moderate, and bad to investigate the learning adaptation capability of the proposed method, where three image categories are defined as three different image contexts. Here, image quality indicates not only the variation of lighting condition, but also that of the eye movement speed and pose. Some examples are given in Fig. 4. Each image sequence has around 2000 image frames, among which half of the image frames are used for GA training, and the remaining frames were used for the test of eye tracking performance.

**F4 :**

Some examples of image frames used in analyzing the effects of image quality variance of the tracking system: (a) images with good quality illumination, (b) images with moderate quality illumination, and (c) images with bad quality illumination, where yellow circles indicate the successes in eye tracking and red circles indicate the failures.

We created seven videos characterized in terms of the illumination, eye movement speed, and pose movement speed as shown in Table 6, where each video has 1000 frames. In our experiments, the frames with blinking and facial expressions are excluded in the performance evaluation since the aim of our eye tracking is focused on HCI applications where a user’s intentional eye movement with a positive attitude can be assumed. Some portions of the test videos are illustrated in Fig. 5.

Illumination | Eye movement | Pose movement | |
---|---|---|---|

Video 1 | Bad-moderate-good | Moderate | Moderate |

Video 2 | Bad-bad-bad | Slow | Fast |

Video 3 | Good-good-good | Fast | Moderate |

Video 4 | Moderate-bad-good-bad-good | Moderate | Moderate |

Video 5 | Good-good-good | Fast | Moderate |

Video 6 | Good-good-good | Slow | Slow |

Video 7 | Moderate-moderate-moderate | Moderate | Fast |

In this subsection, the effect of the RL parameters and other experimental parameters such as the matching threshold and the real-time matching threshold are analyzed and determined empirically. The RL parameters $\epsilon \u2208(0,1]$ from $\epsilon $-greedy policy,^{16} learning rate $\alpha \u2208(0,1]$ and discounting rate $\gamma \u2208(0,1]$ from the action-value equation of $Q$-learning [see Eq. (28)] were investigated. Figure 6 shows average performance of the $\epsilon $-greedy policy with different values of $\epsilon =0.0$, $\epsilon =0.1$, and $\epsilon =0.15$, where the vertical and horizontal axes are real-time tracking confidence and image frame sequence, respectively. These data are the averages over ten volunteers’ videos, where $\alpha $ and $\gamma $ are set to 0.2 and 0.95, respectively. The parameter $\epsilon $ affects the ratio between exploitation and exploration when a next action is decided. One can notice that $\epsilon =0.15$ can achieve faster and better convergence characteristics in real-time eye-tracking confidence.

Figure 7 shows the effect on the eye-tracking performance with different values of $\alpha =0.2$, 0.1, 0.05, 0.3, and 0.01 by fixing $\epsilon =0.15$ and $\gamma =0.95$. One can notice that $\alpha =0.2$ gives fast convergence characteristics in real-time tracking confidence.

Figure 8 shows the effects of RL discounting rate with different values, i.e., $\gamma =0.9$, 0.95, and 0.99 by setting $\epsilon =0.15$ and $\alpha =0.2$.

Considering the real-time constraint, we test the effect of $Q$-learning step limitation per individual image frame. Figure 9 shows the real-time tracking confidences where the maximum number of $Q$-steps per frame is restricted to 15, 20, and 30. As the iteration is reduced the tracking performance suffers, but the eye-tracking speed can be increased.

Figure 10 shows the visualization of the real-time tracking confidence with different values of $\delta =10$, 20, and 30 by setting $\epsilon =0.15$, $\alpha =0.2$, and $\gamma =0.95$. Since the sufficiently good convergence characteristics of the real-time tracking confidence can be observed starting from $\delta =20$, we selected $\delta =20$ considering the time overhead.

Figure 11 shows the rates of the false positive and false negative with different values of real-time matching threshold $tm\u2032=0.575$, 0.6, 0.625, 0.65, and 0.675 by setting $\epsilon =0.15$, $\alpha =0.2$, and $\gamma =0.95$. We decide $tm\u2032$ as 0.625 since the matching threshold can provide approximately equal values of the false positive and false negative.

The experiments are carried out using the RL parameters of $\epsilon =0.15$, $\alpha =0.2$, and $\gamma =0.95$, and the experimental parameters $\delta =20$ in the following discussions.

Extensive experiments were carried out using various image sequences. The performance was estimated using the seven types of videos gathered from different environments. The EA module decided the pivot of the phenotype control parameters, and the discrete phenotype ranges are determined and used as the RL exploration ranges. While large RL ranges are expected to give high performance, they might require much computation overhead, and the frame processing rate will be decreased, i.e., many input frames are skipped. If the frame rate for eye tracking is reduced too much, the continued variation assumption of illumination might not be valid. The algorithm structures of preprocessing and feature extraction are decided as the binarization and the histogram equalization in the EA module, respectively. The Kalman filter parameters were fixed for the simplicity of analysis. We focused on the adaptation functionalities of the feature extraction threshold, i.e., the binarization threshold and the four partial Hough transform parameters which mainly affect the eye tracking performance. We examined the trade-off the performance and computation overhead by adjusting the maximum number of $Q$-learning steps per an image frame. Table 7 shows the performance comparisons of the proposed method using the single agent $Q$-learning to the nonadaptation and the adaptation using the EA-only method. In the non-adaptation method, the eye-tracking scheme uses predetermined threshold and control parameters. In the EA-only method, the genetic code decided by training data is used as the threshold and the control parameters. In the combination of EA and RL methods, the RL is applied to decide the control threshold/parameters in the phenotype control space using the pivot values determined by the EA. The real-time computation overhead of the EA-only method is almost equal to the non-adaptation method since the EA evolution is carried out before the eye tracking is performed.

Image sequence | Frames $A/B$ | Non-adaptation $C|fps$ | EA only $C|fps$ | $EA+RL(S,15)$$C|fps$ | $EA+RL(S,20)$$C|fps$ | $EA+RL(S,30)$$C|fps$ |
---|---|---|---|---|---|---|

Video 1 | $825/1000$ | $524|63.4$ | $557|62.5$ | $754|33.0$ | $774|31.8$ | $790|29.4$ |

Video 2 | $948/1000$ | $734|63.2$ | $789|63.1$ | $895|45.0$ | $916|43.8$ | $937|43.0$ |

Video 3 | $973/1000$ | $631|63.2$ | $665|62.9$ | $835|27.0$ | $855|24.3$ | $870|19.4$ |

Video 4 | $976/1000$ | $535|57.4$ | $701|57.2$ | $891|39.7$ | $922|37.1$ | $938|33.6$ |

Video 5 | $964/1000$ | $631|64.2$ | $702|63.9$ | $778|20.9$ | $788|17.8$ | $827|16.6$ |

Video 6 | $983/1000$ | $812|54.6$ | $826|53.8$ | $919|39.3$ | $934|37.0$ | $944|35.1$ |

Video 7 | $992/1000$ | $725|62.0$ | $792|61.7$ | $863|30.0$ | $893|27.3$ | $926|25.0$ |

Table 8 shows the trade-off between eye-tracking performance and computation overhead of the proposed multi-agent $Q$-learning approach by changing the computation time constraints. Note that the multiagent $Q$-learning can improve eye-tracking performance while increasing computation overhead.

Image sequence | Frames $A/B$ | $EA+RL(M,15)$$C|fps$ | $EA+RL(M,20)$$C|fps$ | $EA+RL(M,30)$$C|fps$ |
---|---|---|---|---|

Video 1 | $825/1000$ | $810|13.2$ | $813|12.2$ | $817|11.8$ |

Video 2 | $948/1000$ | $937|18.1$ | $941|16.8$ | $945|17.2$ |

Video 3 | $973/1000$ | $893|10.8$ | $898|9.3$ | $903|7.8$ |

Video 4 | $976/1000$ | $951|15.6$ | $953|14.2$ | $956|13.5$ |

Video 5 | $964/1000$ | $843|8.36$ | $845|6.8$ | $858|6.7$ |

Video 6 | $983/1000$ | $951|15.7$ | $959|14.5$ | $957|14.0$ |

Video 7 | $992/1000$ | $942|12.0$ | $952|10.5$ | $957|10.0$ |

The performance evaluations using precision, recall, harmonic means, and average tracking rate are given in Table 9 for the one-step $Q$-learning using a single agent and in Table 10 for the multiagent $Q$-learning, respectively. Table 10 shows that $EA+RL$ with 30 $Q$-learning steps $[EA+EL(S,30)]$ achieved the best performance, however, it requires the highest computation overhead. $EA+RL(S,20)$ achieved moderate performance with moderate computation overhead.

Image sequence | No-adaptation $PR|RE|HM|AR$ | EA –only $PR|RE|HM|AR$ | $EA+RL(S,15)$$PR|RE|HM|AR$ | $EA+RL(S,20)$$PR|RE|HM|AR$ | $EA+RL(S,30)$$PR|RE|HM|AR$ |
---|---|---|---|---|---|

Video 1 | $0.89|0.64|0.73|0.63$ | $0.91|0.64|0.75|0.76$ | $0.92|0.85|0.89|0.91$ | $0.93|0.88|0.90|0.93$ | $0.93|0.89|0.91|0.95$ |

Video 2 | $0.93|0.78|0.85|0.77$ | $0.95|0.79|0.86|0.83$ | $0.93|0.90|0.92|0.94$ | $0.95|0.93|0.93|0.96$ | $0.95|0.94|0.94|0.98$ |

Video 3 | $0.92|0.67|0.75|0.64$ | $0.94|0.67|0.78|0.68$ | $0.95|0.79|0.86|0.85$ | $0.95|0.84|0.89|0.87$ | $0.96|0.84|0.89|0.89$ |

Video 4 | $0.73|0.62|0.70|0.61$ | $0.78|0.64|0.70|0.71$ | $0.75|0.70|0.72|0.91$ | $0.78|0.73|0.75|0.94$ | $0.78|0.74|0.75|0.96$ |

Video 5 | $0.84|0.53|0.65|0.65$ | $0.84|0.55|0.67|0.72$ | $0.79|0.65|0.71|0.80$ | $0.89|0.69|0.78|0.81$ | $0.87|0.72|0.79|0.85$ |

Video 6 | $0.90|0.79|0.84|0.82$ | $0.93|0.80|0.86|0.84$ | $0.94|0.89|0.91|0.93$ | $0.94|0.91|0.90|0.95$ | $0.94|0.93|0.92|0.96$ |

Video 7 | $0.92|0.41|0.56|0.73$ | $0.94|0.42|0.58|0.79$ | $0.96|0.76|0.85|0.86$ | $0.97|0.80|0.87|0.90$ | $0.97|0.84|0.90|0.93$ |

Mean | $0.87|0.63|0.73|0.69$ | $0.90|0.64|0.74|0.76$ | $0.89|0.79|0.84|0.89$ | $0.92|0.83|0.86|0.91$ | $0.91|0.84|0.87|0.93$ |

Image sequence | $EA+RL(M,15)$$PR|RE|HM|AR$ | $EA+RL(M,20)$$PR|RE|HM|AR$ | $EA+RL(M,30)$$PR|RE|HM|AR$ |
---|---|---|---|

Video 1 | $0.93|0.90|0.91|0.98$ | $0.93|0.91|0.92|0.98$ | $0.93|0.92|0.92|0.99$ |

Video 2 | $0.94|0.94|0.94|0.99$ | $0.94|0.94|0.94|0.99$ | $0.95|0.94|0.95|0.99$ |

Video 3 | $0.95|0.86|0.90|0.92$ | $0.95|0.87|0.91|0.92$ | $0.96|0.88|0.92|0.92$ |

Video 4 | $0.77|0.75|0.76|0.97$ | $0.77|0.75|0.76|0.97$ | $0.78|0.76|0.77|0.97$ |

Video 5 | $0.87|0.73|0.80|0.87$ | $0.88|0.75|0.81|0.87$ | $0.88|0.76|0.81|0.89$ |

Video 6 | $0.94|0.91|0.93|0.96$ | $0.94|0.92|0.93|0.97$ | $0.94|0.92|0.93|0.97$ |

Video 7 | $0.96|0.86|0.90|0.94$ | $0.96|0.80|0.91|0.95$ | $0.97|0.89|0.92|0.96$ |

Mean | $0.90|0.85|0.88|0.95$ | $0.91|0.85|0.88|0.95$ | $0.92|0.87|0.89|0.96$ |

The multiagent $Q$-learning could achieve better performance than the one-step $Q$-learning, while adding more computation overhead. In Table 10, the three-agent $Q$-learning with different time constrains are investigated where the total number of $Q$-learning steps per an image frame of each agent is restricted from 15, 20, and 30.

The proposed method can guarantee optimal real-time performance by compromising the trade-off between the performance and the computation overhead. The tradeoffs of execution speed and tracking performance among different system architectures are compared in Fig. 12 using the average tracking rate, precision, recall, and harmonic mean, respectively. The horizontal axes indicate the time overhead using fps. $Q$-learning parameters are set to $\epsilon =0.15$, $\alpha =0.2$, $\gamma =0.95$, and $\delta =20$. Notice that higher performance can be obtained if the time overhead is increased.

**F12 :**

The tradeoffs of the execution speed and the tracking performance among different system architectures where $Q$-learning parameters are set to $\epsilon =0.15$, $\alpha =0.2$, $\gamma =0.95$, and $\delta =20$: (a) the average of real-time eye tracking, (b) the precision, (c) the recall, and (d) the harmonic mean. The horizontal axes indicate the time overhead using fps and AR indicates the average tracking rate, PR the precision, RE the recall, and HM the harmonic mean.

## Concluding Remarks

Combining the EA approach and the RL algorithm, i.e., the difficulty of the EA in a real-time adaptation and the curse of huge search space of the RL algorithm, the proposed method can manage to achieve the performance of the eye-tracking scheme in terms of accuracy and speed. The state-of-the-art of eye-tracking techniques relies on either high cost and high quality cameras/equipment or a strongly controlled situation. To overcome limitations, possible configurations of the eye-tracking scheme are explored in the genotype and phenotype control spaces by the EA and the RL, respectively. The EA is responsible for exploring the genotype control space using a collected training image set, and the RL algorithm organizes an evolved genotype into a reactive phenotype that optimizes the eye-tracking scheme in real-time performance. The EA encodes and stores the learned optimal scheme in terms of a genetic code which denotes an eye-tracking algorithm structure and associated thresholds/parameters. The RL algorithm defines its internal environmental states in a phenotype control space and mainly performs interactive learning and adaptation in real-time.

The main contribution of the proposed method compared to other popular adaptive systems is to provide a design technique that can manage an intrinsic uncertainty of vision based HCI into a tractable computation space for controlling the algorithm structure and associated thresholds/parameters. It can optimize the performance of low-cost eye-tracking schemes in accordance with a changing environment. The proposed method defines the internal RL environment connecting with the genotype control space instead of the input image directly as other RL environment^{21} so that the interactive characteristics of the RL algorithm are fully utilized by exploring the threshold/parameter space influenced by a changing external environment. Since the main focus of the proposed method is to optimize an eye-tracking system with acceptable computation resources, we discussed using a relatively simple eye-tracking method. However, the proposed framework can be extended to other eye tracking methods encountering difficulties in achieving optimal performance due to unstable operational environments. The future direction of our research is to apply the proposed method in a challenging vision-based HCI application for next-generation computing such as eye cursors, eye keyboards, eye mice, and vision-based emotion detection.