This paper proposes a multi-modal sensor fusion algorithm for the estimation of driver drowsiness. Driver
sleepiness is believed to be responsible for more than 30% of passenger car accidents and for 4% of all accident
fatalities. In commercial vehicles, drowsiness is blamed for 58% of single truck accidents and 31% of commercial
truck driver fatalities. This work proposes an innovative automatic sleep-onset detection system. Using multiple
sensors, the driver’s body is studied as a mechanical structure of springs and dampeners. The sleep-detection
system consists of highly sensitive triple-axial accelerometers to monitor the driver’s upper body in 3-D. The
subject is modeled as a linear time-variant (LTV) system. An LMS adaptive filter estimation algorithm generates
the transfer function (i.e. weight coefficients) for this LTV system. Separate coefficients are generated for the
awake and asleep states of the subject. These coefficients are then used to train a neural network. Once trained, the
neural network classifies the condition of the driver as either awake or asleep. The system has been tested on a
total of 8 subjects. The tests were conducted on sleep-deprived individuals for the sleep state and on fully awake
individuals for the awake state. When trained and tested on the same subject, the system detected sleep and
awake states of the driver with a success rate of 95%. When the system was trained on three subjects and then retested
on a fourth “unseen” subject, the classification rate dropped to 90%. Furthermore, it was attempted to
correlate driver posture and sleepiness by observing how car vibrations propagate through a person’s body. Eight
additional subjects were studied for this purpose. The results obtained in this experiment proved inconclusive
which was attributed to significant differences in the individual habitual postures.
This paper presents a face detection system that synergizes audio localization and visual face detection. This audiovisual
face detection system is based on microphone sound localization, and image processing algorithms. The
system integrates the application of sound localization by Time Delay of Arrival and the iterative application of
Adaptive Background Segmentation, to robustly perform real-time face detection on a stream of webcam images.
Experimental results using an array of 24 microphones and a fixed-view webcam, show that the audiovisual face
detection system is able to perform face detection of success rate 97.5% at 0.82 seconds of convergence time, and
5.8Hz display frame rate, on a Pentium IV 2.5GHz.
A general method for time delay of arrival (TDOA) estimation for time-frequency information fusion is analyzed. This technique, for which the generalized cross correlation method and histogram methods are special cases, results in a low TDOA estimation error and high efficiency in computation. The proposed method relies on a non-linear phase-error selector function, which acts as a reward and punish method for the phase error at each frequency. Three different selector function candidates, consisting of cosine, rectangular, and triangular functions are analyzed using simulations. In the presence of Gaussian noise, the rectangular selector function performs better than the cosine at signal-to-noise ratios (SNRs) higher than 10dB while for lower SNRs the cosine function performs better. With speech noise, the cosine function, which corresponds to the generalized cross correlation technique, has higher anomaly percentages and higher root-mean-square errors than the rectangular function. This suggests that in general, the rectangular selector function, which can be computed more easily than the cosine selector function, is superior technique to the generalized cross correlation method for real-time applications.
This paper analyzes the beamforming separation effectiveness of several different microphone array configurations. These configurations consist of 4, 8, and 16 microphones placed in a plane for two-dimensional speech separation, with inter-microphone distances varying from 10cm to 160cm in logarithmic step sizes. It is discovered that the linear and bi-linear arrays result in the largest signal-to-noise ratio (SNR) gain after delay-and-sum beamforming in the case of speech signal and speech noise. Also, simulations show that larger inter-microphone distances result in a higher SNR gain, although the practicality of the higher inter-microphone distances is limited for certain applications where the array size is constrained.
A technique to virtually recreate speech signals entirely from the visual lip motions of a speaker is proposed. By using six geometric parameters of the lips obtained from the Tulips1 database, a virtual speech signal is recreated by using a 3.6s audiovisual training segment as a basis for the recreation. It is shown that the virtual speech signal has an envelope that is directly related to the envelope of the original acoustic signal. This visual signal envelope reconstruction is then used as a basis for robust speech separation where all the visual parameters of the different speakers are available. It is shown that, unlike previous signal separation techniques, which required an ideal mixture of independent signals, the mixture coefficients can be very accurately estimated using the proposed technique in even non-ideal situations.
The benefits and problems of a multi-camera object localization system utilizing Spatial Likelihood Functions (SLF) are explored. This method utilizes the angular extent of objects perceived by different cameras in order to find the region in which they intersect. This region will ideally correspond to the original location of the objects. It is shown that as long as the number of cameras is greater than the number of objects, an efficient camera fusion algorithm utilizing SLFs can be successfully employed to localize the objects. In certain situations, especially with a greater number of objects than cameras, false objects will appear among the correctly localized objects. Several different techniques to identify and remove the false objects are proposed, including a heuristic-based ray tracing approach and other multi-modal techniques. The effectiveness of the camera fusion and false object removal approaches are illustrated in the context of several examples.
In the past several years, many different algorithms have attempted to address the problem of robust multi-source time difference of arrival (TDOA) estimation, which is necessary for sound localization. Different approaches, including general cross correlation, multiple signal classification (MUSIC), and the maximum likelihood (ML) approach, have made different trade- offs between robustness and efficiency. A new approach presented here offers a much more efficient yet robust mechanism for TDOA estimation. This approach iteratively uses small sound signal segments to compute a local cross-correlation based TDOA estimate. All of the different local estimates are combined to form the probability density function of the TDOA. Because the power of the secondary sources will be greater than the others for a certain set of the local signal segments, the TDOA corresponding to these sources will be associated with a peak in the TDOA probability density function. This way, the TDOAs of several different sources, along with their signal strength can be estimated. A real time implementation of the proposed approach is used to show its improved accuracy and robustness. The system was consistently able to correctly localize sound sources with SNRs as low as 3 dB.
This paper proposes a Bayesian multi-sensor object localization approach that keeps track of the observability of the sensors in order to maximize the accuracy of the final decision. This is accomplished by adaptively monitoring the mean-square-error of the results of the localization system. Knowledge of this error and the distribution of the system's object localization estimates allow the result of each sensor to be scaled and combined in an optimal Bayesian sense. It is shown that under conditions of normality, the Bayesian sensor fusion approach is directly equivalent to a single layer neural network with a sigmoidal non-linearity. Furthermore, spatial and temporal feedback in the neural networks can be used to compensate for practical difficulties such as the spatial dependencies of adjacent positions. Experimental results using 10 binary microphone arrays yield an order of magnitude improvement in localization error for the proposed approach when compared to previous techniques.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.