Review Papers

Survey of contemporary trends in color image segmentation

[+] Author Affiliations
Sreenath Rao Vantaram

Rochester Institute of Technology, Chester F. Carlson Center for Imaging Science, Rochester, New York 14623

Eli Saber

Rochester Institute of Technology, Chester F. Carlson Center for Imaging Science, Rochester, New York 14623

Rochester Institute of Technology, Department of Electrical and Microelectronic Engineering, Rochester, New York 14623

J. Electron. Imaging. 21(4), 040901 (Oct 04, 2012). doi:10.1117/1.JEI.21.4.040901
History: Received May 1, 2012; Revised August 20, 2012; Accepted August 29, 2012
Text Size: A A A

Open Access Open Access

Abstract.  In recent years, the acquisition of image and video information for processing, analysis, understanding, and exploitation of the underlying content in various applications, ranging from remote sensing to biomedical imaging, has grown at an unprecedented rate. Analysis by human observers is quite laborious, tiresome, and time consuming, if not infeasible, given the large and continuously rising volume of data. Hence the need for systems capable of automatically and effectively analyzing the aforementioned imagery for a variety of uses that span the spectrum from homeland security to elderly care. In order to achieve the above, tools such as image segmentation provide the appropriate foundation for expediting and improving the effectiveness of subsequent high-level tasks by providing a condensed and pertinent representation of image information. We provide a comprehensive survey of color image segmentation strategies adopted over the last decade, though notable contributions in the gray scale domain will also be discussed. Our taxonomy of segmentation techniques is sampled from a wide spectrum of spatially blind (or feature-based) approaches such as clustering and histogram thresholding as well as spatially guided (or spatial domain-based) methods such as region growing/splitting/merging, energy-driven parametric/geometric active contours, supervised/unsupervised graph cuts, and watersheds, to name a few. In addition, qualitative and quantitative results of prominent algorithms on several images from the Berkeley segmentation dataset are shown in order to furnish a fair indication of the current quality of the state of the art. Finally, we provide a brief discussion on our current perspective of the field as well as its associated future trends.

Color image segmentation facilitates the separation of spatial-spectral attributes contained in images into their individual constituents; a task that is accomplished quite comfortably by our visual system and cortical mechanisms. However, mimicking this capability of human observers in an artificial environment has been found to be an extremely challenging problem. Formally, color image segmentation is defined as the process of partitioning or segregating an image into regions (also called as clusters or groups), manifesting homogeneous or nearly homogeneous attributes such as color, texture, gradient as well as spatial attributes pertaining to location. Fundamentally, a segmentation algorithm for an image is said to be “complete” when it provides a unique region or label assignment for every pixel, such that all pixels in a segmented region satisfy certain criteria while the same principles are not universally satisfied for pixels from disjoint regions.

The cardinal motivation for image segmentation is twofold. It not only provides an end user with the flexibility to efficiently access and manipulate individual content, but also furnishes a compact representation of the data wherein all subsequent processing can be done at a region/segment level as opposed to the pixel level, resulting in potentially significant computational gains. To this effect, segmentation is predominantly employed as a preprocessing step to annotate, enhance, analyze, classify, categorize, and/or abstract information from images. In general, there are many applications for color image segmentation in the image processing, computer vision, and pattern recognition fields, including content-based image retrieval (CBIR), image rendering, region classification, segment-based compression, surveillance, perceptual ranking of regions, graphics, and multimedia to name a few. Furthermore, many approaches have been developed in other modalities of imaging such as remote sensing (multi/hyperspectral data) and biomedical imaging [computed tomography (CT)], positron emission tomography (PET), and magnetic resonance imaging (MRI) data for sophisticated applications such as large area search, three-dimensional (3-D) modeling, visualization, and navigation. The exponential growth of the number of applications that employ segmentation in itself provides a strong motivation for continued research and development.

In the context of color imagery, segmentation is often viewed as an ill-defined problem with no perfect solution but multiple generally acceptable solutions due to its subjective nature. The subjectivity of segmentation has been extensively substantiated in experiments1 conducted at the University of California at Berkeley to develop an evaluation benchmark, where a database of manually generated segmentations of images with natural content was developed using multiple human observers. In Fig. 1(a), 10 images (arbitrarily named airplane, starfish, race cars, hills, boat, church, cheetah, dolphins, lake, and skydiver) from the aforementioned database are displayed. Additionally, several manually segmented ground truths with region boundaries superimposed (in green) on the original image are shown in Fig. 1(b) to 1(f). Analysis of the obtained ground truth results by Martin et al. divulged two imperative aspects: 1. an arbitrary image may have a unique suitable segmentation outcome while others possess multiple acceptable solutions, and 2. the variability in adequate solutions is primarily due to the differences in the level of attention (or granularity) and the degree of detail from one human observer to another, as seen in Fig. 1. Consequently, most present day algorithms for segmentation aim to provide generally acceptable outcomes rather than a “gold standard” solution.

Graphic Jump LocationF1 :

Berkeley segmentation benchmark:1 (a) original images, and (b) to (f) region boundaries of multiple manually generated segmentations overlaid on the images.

There are several excellent surveys of image segmentation strategies and practices. The studies done by Fu et al.2 and Pal et al.3 are amongst the earliest ones that have been widely popular. In their work, Fu et al.2 categorized segmentation approaches developed during the 1970s and early 1980s for gray scale images into three classes; namely, clustering, edge detection, and region extraction. On the other hand, Pal et al.3 reviewed more complex segmentation techniques established in the late 1980s and early 1990s that involved fuzzy/nonfuzzy mechanisms, markov random fields (MRFs) probabilistic models, color information as well as neural networks—all of which were still in their early stages of development. The surveys done by Lucchese et al.4 and Cheng et al.5 were among the first that exclusively provided an in-depth overview of algorithms targeted at segmenting color images, instituted throughout the 1990s.

In this paper, we provide a comprehensive overview of the image segmentation realm with the goals to: 1. facilitate access to contemporary procedures and practices developed in the recent past (2001 to current), 2. establish current standards of segmentation outcomes from a qualitative and quantitative standpoint by displaying results acquired from state-of-the-art techniques, 3. discuss our view on the field as it stands today, and 4. outline avenues of future research. The remainder of this paper is organized as follows. Section 2 provides broad and specific categorizations of segmentation approaches based on their inherent methodology and illustrates experimental results derived from prominent color image segmentation algorithms. Furthermore, Sec. 3 provides a brief quantitative evaluation of the aforementioned algorithms. Finally, conclusions and future directives are presented in Sec. 4.

Segmentation procedures can be broadly categorized from a high level perspective as well as specifically grouped based on their technical grounding (low level classification). The following subsections describe each of the two taxonomies in detail.

High-Level Taxonomy

Image segmentation techniques can, in general, be broadly classified (see Fig. 2) based on: 1. the image type, 2. the extent of human interaction, 3. the manner in which the image is represented for processing, 4. the number and type of attributes used and, 5. the fundamental principle of operation.

Graphic Jump LocationF2 :

High-level taxonomy of image segmentation algorithms.

The first criterion segregates algorithms that have been developed for monochrome (or single band) images from the ones that handle color (or three band) images. The second criterion distinguishes methods that require human intervention (supervised processes) for segmentation from the ones that operate without any manual interference (fully automatic or unsupervised processes). The third criterion separates segmentation procedures that directly operate on the original image (single scale configuration) from the ones that operate on multiple representations of the image (multiscale configuration), each manifesting different amount of information. The fourth criterion differentiates algorithms based on the type of image information (e.g., gray/color intensity, gradient/edge, or textural features) utilized to perform the segmentation. It is imperative to note that most methods use the aforementioned image attributes individually (single attribute methods) or in specific combinations (multiple attribute methods) to categorize them. Finally, the last criterion based on the underlying principle of operation discriminates segmentation mechanisms as being either spatially blind or spatially guided. Spatially blind approaches as the name suggests are techniques that are “blind” to spatial information, or, in other words, do not take into account the spatial arrangement of pixels in an image. Instead these methods rely heavily on grouping image information in a suitable attribute/feature space. On the other hand, spatially guided approaches tend to exploit the spatial arrangement of pixels in an image during the segmentation process.

Low-Level Taxonomy

As mentioned previously, most segmentation modus operandi can be viewed as being either spatially blind or spatially guided. It is this distinction that forms the basis of our low-level taxonomy where we specifically group segmentation procedures based on their technical components, as depicted in Fig. 3.

Graphic Jump LocationF3 :

Low-level taxonomy of image segmentation algorithms.

Spatially blind approaches

Spatially blind approaches perform segmentation in certain attribute/feature spaces, predominantly related to intensity (gray or color). Popular segmentation techniques that fall within the notion of being spatially blind involve clustering and histogram thresholding.

Clustering

In its simplest form, clustering is a spatially blind technique wherein the image data is viewed as a point cloud on a one-dimensional (1-D) gray scale axis or in a 3-D color space (see Fig. 4) depending on the image type.

Graphic Jump LocationF4 :

Sample color images with their corresponding 3-D point clouds.

Several different color spaces—such as RGB, Commission International de l’Eclairage (CIE) L*a*b* and L*u*v*, YCbCr, HSI etc., to name a few—with different properties have been extensively utilized for segmentation.6 The essence of a typical clustering protocol is to analyze this gray/color intensity point cloud and partition it using predefined metrics/objective functions to identify meaningful pixel groupings also known as classes or clusters. Furthermore, the clustering process is done such that, when complete, the pixel data within a specific class possess, in general, a high degree of similarity while the data between classes has low affinity. The biggest advantage of clustering approaches over others is inherent in their simplicity and ease of implementation. However, since the point clouds generated are image dependent, selecting/initializing the number of clusters so as to obtain semantic partitioning with respect to the image being processed is a challenging task, especially in the case of color imagery. Furthermore, as the dimensionality of the feature space is increased exponentially, acquiring definitive clusters becomes an arduous task.

Although many clustering approaches have been developed for various applications, partitioning a feature space using a specific set of fixed points is the most widely adopted approach. Voronoi tessellation (VT) is a procedure in which a feature space is decomposed into various clusters (called Voronoi cells/regions) using a fixed set of points called sites, such that each observation in the feature space is assigned to the closest site predicated on a certain distance metric. More specifically, if X is a feature space constrained with a distance function d, and (Pk)kK is a set of K sites in the space, then a Voronoi cell Vk formed using the site Pk is the set of all points xX that satisfy: Display Formula

Vk=[xX|d(x,Pk)d(x,Pj)jk],(1)
where d(x,Pk) represents the distance from x to Pk. Arbeláez et al.7 proposed a VT-based image segmentation technique utilizing color and lightness information derived from the image. The segmentation objective was achieved in a two-step process comprised of: 1. presegmentation and 2. hierarchical representation. The presegmentation step employed a VT process wherein the extreme components of the lightness (L*) channel were used as sites to form an extrema mosaic of Voronoi regions. The second step involved the development of a stratified hierarchy of partitions derived from the extrema mosaic using a pseudo-distance metric called ultrametric, specifically defined for color images. Subsequently, a single real-valued soft boundary image called the ultra-metric contour map (UCM) was constructed to arrive at the final segmentation.

Centroidal voronoi tessellation (CVT) is a special category of VT wherein the sites producing Voronoi cells are chosen equivalent to their center of mass. Wang et al.8 generalized the basic CVT by integrating an edge-related energy function with a classic clustering energy metric to propose an edge-weighted centroidal voronoi tessellation (EWCVT) for effective segmentation of color images. CVTs form the core of many prominent clustering algorithms such as K-means. The K-means algorithm partitions a set of n-pixels into K clusters by minimizing an objective function. From a color segmentation perspective, the aforementioned algorithm analyzes the image data in a 3-D space to eventually identify K-sites (known as cluster centers or means) such that the mean squared distance from each data point to its nearest center is minimized. To this effect, in an arbitrary iteration (called as a Voronoi iteration or Lloyd’s algorithm), the entire color space is separated into K partitions by assigning each observation to the cluster with the closest center (note initialization in the first iteration may be randomly done). Following this, a new estimate of the cluster center is computed based on the current cluster assignment information and is utilized as an input to the next iteration of the algorithm. The algorithmic steps described above are repeated until convergence is achieved. McQueen9 was the first to employ the K-means algorithm to handle multivariate data. Among recent advances, Kanungo et al.10 proposed an efficient version of the algorithm called the “filtering algorithm,” by utilizing a k-dimensional (kd) tree representation of the image data. For each node of this tree, a set of candidate centers were determined and strategically filtered as they were propagated to the node’s children. The biggest advantage of this approach was that, since the kd-tree representation was formed from the original data rather than from the computed centers, it did not mandate an update in its structure for all iterations, in contrast to the conventional K-means architecture. Chen et al.11 employed a generalization of the classical K-means algorithm better known as the adaptive clustering algorithm (ACA), with spatially adaptive features pertaining to color and texture, to yield perceptually tuned segmentations. Consequently, the ACA method is an exception to the norm of spatially blind clustering. In his work, Mignotte12 proposed a novel color image segmentation procedure based on the fusion of multiple K-means clustering results by minimizing the Euclidean distance function, obtained from an image described in six different color spaces namely RGB, HSV, YIQ, XYZ, LAB, and LUV. Once the label fields from each of these color spaces are obtained, a local histogram of the class labels across the aforementioned label fields is computed for each pixel, and the set of all histograms are employed as input feature vectors to a fusion mechanism that culminates in the final segmentation output. The fusion scheme is comprised of the K-means algorithm using the Bhattacharya similarity coefficient, which is a histogram-based similarity metric. The algorithm in Mignotte12 was further enhanced in Mignotte13 by using a spatially constrained K-means labeling process in place of the fusion mechanism to arrive at the optimal result. While the prior algorithm developed by Mignotte was aimed at exploring the possibility of integrating multiple segmentation maps from simple data partitioning models to obtain an accurate result, the later algorithm was novel in the sense that within the K-means framework implicit spatial associations in an image were taken into account (similar to the work in 14) to uncover the best solution, and consequently the process was not completely spatially blind.

Mean shift clustering15 is another routine that has had pervasive use for gray/color image segmentation within the last decade. This generic nonparametric technique facilitates the analysis of multidimensional feature spaces with arbitrarily shaped clusters, based on the “mean shift” concept, originally proposed by Fununaga et al.16 The mean shift procedure is a kernel density estimation (or Parzen window-based technique) that scrutinizes a feature space as an empirical probability density function (pdf) and considers the set of pixel values from an arbitrary image as discrete samples of that function. The procedure exploits the fact that clusters/dense regions in a feature space typically manifest themselves as modes of the aforementioned pdf. In what follows, if S is a finite point cloud in an n-dimensional Euclidean space, X and K is a symmetric Kernel function of specific characteristics, then the sample mean m(x) at a pixel xX computed utilizing a weighted combination of its nearby points determined by K is given as:17Display Formula

m(x)=sSK(sx)ssSK(sx).(2)

To this effect, at every pixel location x, a mean shift vector m(x)x is obtained with K centered at x, such that the vector points towards the direction of the maximum increase in density. Subsequently, the operation xm(x) is performed that shifts the value of x toward the mean followed by the re-estimation of m(x). This process is repeated until convergence of m(x) is achieved. At the end of the mean shift process, the closest peak in the pdf is identified for each pixel. Since the mean shift algorithm uses spatial knowledge in its framework, it also represents an exception to conventional spatially blind clustering. Mean shift clustering guided by edge information was first seen in the work by Christoudias et al.,18 who proposed the edge detection and image segmentation (EDISON) system, aimed at improving the sensitivity of extracting homogeneous regions while maintaining or ideally minimizing over-segmentation of an image. Figure 5 illustrates a few results of the EDISON system using default parameters (spatial band hs=7, color band width hr=6.5, and minimum region size M=20). Hong et al.19 proposed an improved version of the mean shift segmentation algorithm by incorporating: 1. an enhanced technique for mode detection, 2. an optimized process for the global analysis of the locally identified modes, and 3. the elimination of textured areas in order to achieve stable results in various background conditions. Ozden et al.20 pioneered an effective technique that combined low-level color and spatial and texture features in the mean shift framework for color image segmentation.

Neural networks-based data clustering is a category that has originated exclusively from the field of artificial intelligence. Within this domain, methods involving self-organizing maps (SOMs) have received the most attention in the last decade. A self-organizing map or a self-organizing feature map (SOFM)—alternately known as a Kohonen map—is a specific kind of artificial neural network (ANN) that was first introduced by Kohonen21 as a tool for providing intelligent representations of high/multi-dimensional feature spaces in significantly lower (one or two) dimensions. A SOM (shown in Fig. 6) comprises of an input layer of nodes/neurons organized in a vector whose size is equivalent to the dimensions of the input feature space. Each node is connected in parallel to a two-dimensional (2-D) output layer of neurons in a rectangular or hexagonal arrangement as well as their corresponding neighboring neurons utilizing an appropriate weighting scheme that signifies the strength of various connections. A SOM operates in a “training” phase that gradually constructs a feature map using a subset of samples from the input feature space, followed by a mapping routine in which an arbitrary new input sample is automatically classified. At the culmination of the two modes of operation, a low-dimensional map that reflects the topological relationships of samples in the feature space predicated on their similarity is subsequently generated. In other words, samples that have similar characteristics in the input feature space form distinct clusters in this map.

Graphic Jump LocationF6 :

Self-organizing map (SOM) in a rectangular neural layout.

Huang et al.22 developed a color image segmentation methodology that employed a two-stage SOM-based ANN. The algorithm is initiated by an RGB to HVC (hue-value-chroma) color conversion of the input image, which is employed by an SOM to identify a large initial set of color classes. The resultant set of classes are further refined by first computing the normalized Euclidean distance among them, and the obtained between-class distances are furnished as inputs to a second SOM that identifies the final batch of segmented clusters. In a similar scheme, Ong et al.23 constructed a color image segmentation technique based on a hierarchical two-stage SOM in which the first stage identifies dominant colors in the input image presented in the L*u*v* color space, while the second stage integrates a variable-sized 1-D feature map and cluster merging/discarding operations to acquire the eventual segmentation result. Li et al.24 demonstrated an effective color image segmentation approach using a SOM and the fuzzy-C-means (FCM) clustering procedure. The algorithm is initiated by finding well-suited image-dependent features derived from five different color spaces using a SOM. Subsequently, the obtained features were employed in a FCM protocol to attain the output segmentation. Dong et al.25 instituted two alternate ANN-based strategies for color image segmentation. The first strategy was unsupervised. It involved distinguishing a set of color prototypes using SOM-based learning from the input image represented in the L*u*v* color space. These color prototypes were passed on to a simulated annealing-driven clustering routine to yield well-defined clusters. The second method, built off the aforementioned algorithm, was coupled with hierarchical pixel learning (that generated different sizes of color prototypes in the scene) and classification protocols to provide more accurate segmentation outcomes in a supervised fashion. Partitioning of color imagery using SOM and adaptive resonance theory (ART) was first seen in the work of Yeo et al.,26 who proposed two compound ANN architectures called SOMART and SmART (SOM unified with a variation of ART) that yielded improved segmentations in comparison to traditional SOM-based techniques. On the other hand, Araújo et al.27 designed a fast and robust self-adaptive topology ANN model called local adaptive receptive field SOM (LARFSOM) that deduced compact clusters and inferred their appropriate number based on color distributions learned rapidly from the network’s training phase using a small percentage of pixels from the input image. The algorithm was tested on color images with varying segmentation complexities and was found to outperform several prior SOM-based techniques. Frisch28 introduced a novel approach robust to illumination variations that employed SOMs for the construction of fuzzy measures applicable to color image segmentation. This work was a unique attempt wherein efficient fuzzy measures, to accomplish the segmentation task, were derived using SOM-based processing. Ilea et al.29 devised a fully automatic image segmentation algorithm called CTex using color and texture descriptors. The CTex segmentation architecture first extracts dominant colors from the input image presented in the RGB and YIQ color spaces using an SOM classifier. In doing so, the appropriate number of clusters (K) in the scene are also identified. Subsequently, a conventional K-means clustering algorithm is employed in a six-dimensional (6-D) multispace spanned by both the above stated color spaces to obtain a segmentation result purely based on color information. This is followed by the computation of textural features using a Gabor filter bank, which, along with the previously acquired segments, are provided as input to a novel adaptive spatial K-means (ASKM) clustering algorithm that delineates coherent regions of color and texture in the input image.

The clustering techniques discussed so far are typically categorized as hard clustering approaches since every observation in the feature space has a unique and mandatory cluster assignment yielding clusters with sharp boundaries. In contrast, significant work has been done for the advancement of fuzzy clustering methods that facilitate observations to bear a certain degree of belongingness or membership to more than one cluster, resulting in overlapping clusters and/or clusters with “soft” boundaries. The Fuzzy-C-Means (FCM) algorithm, originally developed by Dunn30 is the most widely utilized fuzzy clustering methodology and is similar to the K-means technique, partitions a set of n-pixels (X={x1,,xn}) into C fuzzy clusters (C={c1,,cn}) by minimizing an objective function. The objective function utilized by the FCM algorithm is represented as: Display Formula

Jm=i=1nj=1cuijmxicj2,(3)
where Display Formula
uijm=1k=1c(xicjxick)2/(m1),andcj=i=1nuijmxii=1nuijm.(4)
From Eqs. (3) and (4) it can be inferred that the FCM objective function differs from K-means by incorporating membership values uij for various observations xi in the feature space as well as a “fuzzifier” term m|m{1m} that directs the extent of cluster fuzziness.

In their work, Yang et al.31 proposed two eigen-based fuzzy clustering routines namely, separate eigenspace FCM (SEFCM) and couple eigen-based FCM (CEFCM), for segmentation of objects with desired attributes in color imagery. Given an arbitrary image with a predefined set of pixels, the color space in which the image is expressed is initially divided into two eigenspaces called principal and residual eigenspaces using the Principal Component Transformation. Following this, the SEFCM design obtains a segmentation output by integrating the results of independently applying the FCM algorithm to the aforementioned eigenspaces. The integration process involves a logical selection of common pixels from the two clustering results. On the other hand, the CEFCM arrangement obtains an output segmentation result by jointly considering the principal and residual eigenspaces. Both routines were found to outperform the traditional FCM clustering approach from a color object segmentation perspective. Liew et al.32 instituted an adaptive fuzzy clustering scheme by imposing local spatial continuity using contextual information. The method was targeted for exploiting inter-pixel correlation existent in most conventional imagery in a fuzzy framework. Chen et al.14 proposed a computationally efficient version of the FCM algorithm using a two-phase scheme involving data reduction followed by clustering. This computationally more efficient approach was found to produce results of similar quality to the conventional FCM. More recently, Hung et al.33 developed a weighted FCM (WFCM) clustering technique wherein the weights for various features were computed using a bootstrap method. Incorporating the bootstrap approach was found to provide satisfactory weights to individual features from a statistical variation viewpoint and enhance the performance of the WFCM procedure. Tziakos et al.34 proposed an approach using the Laplacian Eigen (LE) map algorithm, a manifold learning technique, to boost the performance of FCM clustering. The methodology is commenced by extracting local image characteristics from overlapping regions in a high-dimensional feature space, from which a low-dimensional manifold was learned using spectral graph theory. Following this, the LE-based dimensionality reduction technique was used to compute a low dimensional map that captured local image characteristic variations, eventually used to enhance the performance of FCM clustering. Krinidis et al.35 and Wang et al.36 developed alternate yet efficient versions of the FCM scheme that employed both local intensity and spatial information. Yu et al.37 founded an ant colony-fuzzy C-means hybrid algorithm (AFHA) for color image segmentation that adaptively clustered image elements utilizing intelligent cluster center initializations as well as subsampling for computational efficiency. The results of the AFHA structure were found to have smaller distortions and more stable cluster centroids over the conventional FCM.

Besides the practices discussed so far in this section, several unique clustering-based methods for image segmentation have also been proposed. Veenman et al.38 developed an efficient and optimized model for clustering using a cellular co-evolutionary algorithm (CCA) that does not require any prior knowledge of the number of clusters. On the other hand, Allili et al.39 instituted a clustering model that combined a generalized Gaussian mixture model with a pertinent feature selection to alleviate problems of under/over segmentation. Jeon et al.40 introduced a sparse clustering method for color image data using the principle of positive tensor factorization (PTF). Aghbari et al.41 proposed a hill-manipulation process where the protocol of segmenting an arbitrary color image was treated in an equivalent fashion to that of finding hills in its corresponding 3-D intensity histogram. Ma et al.42 introduced the notion of clustering using lossy data coding and compression and demonstrated their work on natural scene color images. The algorithm in Ma et al.42 was employed by Yang et al.,43 who proposed a compression-based texture merging (CTM) routine that treated segmentation as a task of clustering textural features modeled as a mixture of Gaussian distributions, wherein the clustering methodology was acquired from a lossy data compression protocol. Sample segmentation outcomes of the CTM algorithm using default parameters (coding data length parameter γ=0.2) are exhibited in Fig. 7. Huang et al.44 advocated the concept of pure “clustering-then-labeling” for efficient segmentation of color images.

Histogram thresholding

Histogram thresholding [see 45 for a comprehensive survey] is a spatially blind technique primarily based on the principle that segments of an image can be identified by delineating peaks, valleys, and/or shapes in its corresponding intensity histogram. Similar to clustering, histogram thresholding protocols require minimal effort to realize in comparison with most other segmentation algorithms and function without the need for any a priori information about the image being partitioned. Owed to its simplicity, intensity histogram thresholding initially gained popularity for segmenting gray-scale images. However, during its course of development, it was found that thresholding intensity histograms did not work well for low-contrast images without obvious peaks and yielded ambiguous partitions in the presence of spurious peaks manifested by noise. Additionally, for color images, it was determined that thresholding in a multidimensional space is a difficult task. Figure 8 illustrates color histograms of the starfish and boat images in the RGB space, generated using an open-source ImageJ plugin called Color Inspector 3D.46 Each color bin in the 3-D histogram is represented as a sphere whose volume is proportional to the frequency of occurrence of that color. From the histograms, it can be observed that finding multiple thresholds to efficiently partition them presents a challenging problem.

Graphic Jump LocationF8 :

Sample color images with their corresponding 3-D color histograms.

Kurugollu et al.47 proposed an algorithm for color image segmentation that involved two major steps, namely multithresholding and fusion. The method is initiated by forming 2-D histograms using pair-wise band combinations (RG, GB, and BR), each of which were subjected to a peak finding protocol. Following this, based on the delineated peaks, a multithresholding scheme was used to form a segmentation result unique to each pair of channels. These three segmentation results were fused using a label concordance algorithm and refined using a spatial chromatic majority filter to derive the final segmentation result. In a similar framework, Cheng et al.,48 designed a color image segmentation scheme, based on the idea of thresholding a homogram, which simultaneously captured the occurrence of gray levels along with adjoining homogeneity values among pixels. The segmentation routine was initiated by forming a homogram individually for each color channel, the peaks of which were used to guide a subsequent thresholding scheme to acquire an initial oversegmented set of clusters. Finally, the three sets of clustering results from the red, green, and blue planes, respectively, were united together to achieve the final segmentation. Mushrif et al.49 exploited the concept of Histon thresholding based on rough set theory to devise an efficient algorithm for color image segmentation. A Histon is defined as a set of pixels that are all potentially part of a particular segment. Their three-step architecture involved computing a Histon, followed by thresholding and culminating in a region merging process (discussed in Sec. 2.2.2.1). Additionally, they further enhanced the aforementioned methodology though the work in Mushrif et al.50 using an Atanassov’s intuitionistic fuzzy set (A-IFS) Histon representation of the input image. In their work, Manay et al.51 proposed an adaptive thresholding structure for fast segmentation using an anisotropic diffusion model based on the principle that, for an arbitrary local area, an optimal threshold can be derived close to image edges using a smooth version of it.

Spatially guided approaches

In contrast to spatially blind methods, spatially guided approaches, as the name suggests, are guided by spatial relationships of pixels for segmentation. Their primary objective is to form pixel groupings that are compact or homogeneous from a spatial standpoint, irrespective of their relationships in specific feature spaces. However, despite the development of many spatially guided techniques, the use of region and edge information explicitly or in an integrated framework remain widely-accepted alternatives for the formation of spatially compact regions.

Segmentation techniques that distinctly use region information typically employ protocols involving growing, splitting, and merging individually or in suitable combinations. For the formation of an arbitrary region, growing is a process that starts from a single pixel or small predefined labeled set of pixels called a seed and is based on a certain homogeneity criterion iteratively accumulates pixels around it, as depicted in Fig. 9. The growth of a region stops when pixels satisfying the homogeneity criterion are no longer found. Most growing approaches help create spatially connected and compact regions relative to other routines in literature. Additionally, the established regions possess specific user-defined properties with high tolerance to noise. However, sequential design (pixel-by-pixel agglomeration) of growing procedures often results in intensive computational schemes with significant memory requirements. In addition, the presence of varying shades of colors produce, in general, oversegmented outputs. Furthermore, the quality of the segmentation is heavily dependent on the order in which the seeds are processed.

Graphic Jump LocationF9 :

Seed pixels (left) and region formed after a few iterations of growing (right).

In comparison with the region growing, region splitting is a technique that is initiated with an inhomogeneous segmentation of an image, which is repetitively split until segments satisfying a particular homogeneity criterion are obtained. Splitting can be achieved via diverse methods such as quadrature tree decomposition, watersheds, or implicitly via region growing when multiple seeds are placed in homogeneous areas that fall under different categories of our low-level taxonomy. Consequently, we do not explicitly group them in our discussion. The aforementioned growing and splitting procedures generally yield good results for simple images with well-defined homogeneous regions. However, utilizing them purely based on color homogeneity may, in general, pose difficulties due to varying shades of color, nonuniformity of color spaces, illumination, and texture disparities. Region merging is a process in which subregions—potentially part of a larger identifiable region—are fused together to yield a reduced set of segments that are spatially meaningful with respect to the input image content (see Fig. 10 for a simplified illustration). In general, for reasonably complex images, growing/splitting methods often result in an oversegmented region map. As a result, they are often integrated with some type of a region-merging scheme to improve the final outcome.

Graphic Jump LocationF10 :

Initial regions (left), and updated region map after a few iterations of merging (right).

Region-growing approaches

Fan et al.52 proposed an automatic image segmentation algorithm that begins with an edge detection scheme, wherein the centroids between the detected edges are chosen as the set of candidate seed points. Subsequently, a growth procedure is utilized to spatially integrate pixels, in a recursive fashion, to an appropriately chosen seed from the entire set until the final segmentation is achieved. Wan et al.53 were the first to introduce a theoretical criteria for a specific category of region growing algorithms called symmetric region growing, which are insensitive to the selection of the initial seed points. Fondón et al.54 introduced a multistep region growing procedure for color image segmentation, in which the extent of growth can be controlled using a tolerance parameter dependent on the variance of the actual grown region. Although the method was successful in accurately delineating spatial extent of regions, it necessitated manual selection of seed points and could only handle one region at a time. Qin et al.55 advocated the use of semantic information for effective region growing, and proposed an MRF-based multivariate image segmentation algorithm.

Region-merging approaches

Similar to growing, a significant number of approaches have been proposed that explicitly use a merging protocol for region-based segmentation. Devaux et al.56 built a unique segmentation architecture that employed the Karhunen-Loeve transform (KLT) in combination with color and textural attributes for region-based segmentation of color aerial images. The algorithm separately exploited color and texture information to come up with two initial segmentation maps that are subsequently fused together in a merging protocol. Chen et al.57 developed a segmentation technique based on color contrast. The technique began by converting the color input image from RGB to CIE L*a*b* and then utilized the later computed values to estimate contrast information with four directional operators. The estimated contrast map was thresholded to identify an initial set of regions, which were appropriately merged using a connection and verification process. Nock et al.58 explored a statistical region merging structure of segmenting image data, based on the notion that perceptual grouping with region merging can be effectively used to encapsulate the big picture of a scene, using primary local glimpses of it. Nock et al.59 further enhanced the work in 2005 their work by treating statistical region merging as a nonparametric mixture model estimation problem. In his work, Kim60 devised an approach for segmenting low-depth-of-field images using morphological filters and region merging. The procedure involved an initial conversion of a low-depth-of-field image to an alternate feature space representing higher order statistics (HOS). The resultant HOS map was simplified using morphological reconstruction followed by region merging to produce the output segmentation result. Kuan et al.61 presented a novel region merging strategy for segmenting salient regions in color images. The technique generated an initial set of regions by extracting dominant colors in the input image, using a nonparametric density estimation methodology. Subsequently, a merging protocol based on “importance index” and “merging likelihood” criterions was used to refine the initial set. With a similar global objective to the work in Kuan et al.,61 Liu et al.62 proposed an unsupervised segmentation algorithm aimed at salient object extraction. The method was based on region merging in a binary partition tree (BPT) framework. It utilized a novel dissimilarity measure that considered color difference, area factor, and adjacency degree criterions. A robust termination criterion for conventional region merging algorithms based on a novel distinctness predicate of adjacent regions was proposed in Tan et al.63 The effectiveness of the aforementioned criterion was demonstrated using two new merging criteria based on local and global image characteristics. Region merging techniques using statistical measures from the field of information theory was first seen in the work of Calderero et al.64 The proposed merging protocols were unique in the fact that they did not make any assumptions of color or texture homogeneity of regions, but were characterized by nonparametric region models.

Hybrid growing-merging approaches

Integration of growing and merging is another popular region-based methodology in the segmentation realm. Deng et al.65 proposed the prominently known J-SEGmentation (JSEG) algorithm that integrated color quantization and spatial segmentation for extraction of color-texture regions in images and video (see Fig. 11). The JSEG method commences in a color quantization step utilized to obtain a “color class map,” which is subsequently employed to compute a J-image based on certain spatial constraints. These spatial constraints were designed such that the resultant J-image corresponded to measurements of local homogeneities that acquired high values at region boundaries and low values for homogeneous color-texture regions. Subsequently, the J-image is utilized as a reference to identify suitable seed points to initiate a region growing process, wherein the obtained regions are eventually refined in a merging process using a color similarity criterion. Although the JSEG method was efficient in deriving spatially compact regions, it suffered from the fact that the use of color quantization caused over segmentation in regions of varying shades due to illumination or texture disparities, as viewed in some of the results of Fig. 11 (see cheetah, skydiver, and lake images). The segmentation outcomes displayed in Fig. 11 were obtained using default parametric settings, where the parameters named color quantization threshold (qthresh) and number of scales (Iscale) are, by default, automatically computed, while the region merge threshold parameter (mthresh) was set equivalent to 0.4.

Graphic Jump LocationF11 :

Results of the JSEG algorithm.65

The aforementioned drawback of the JSEG technique to a certain extent was addressed by Wang et al.,66 who advocated the use of mean shift clustering instead of color quantization for improved results. In their work, Wang et al.67 uncovered another drawback of the JSEG procedure by demonstrating that ignoring color discontinuity in the computation of the J-measure caused over-segmented results. To overcome this deficiency, they proposed a novel hybrid measure for homogeneity. Amidst other hybrid approaches, Cheng68 postulated a segmentation procedure for color image data in a growing-merging framework integrated with 3-D clustering and relaxation labeling. Shih et al.69 developed a segmentation algorithm based on seeded region growing and merging, incorporating strategies to avoid pixel order dependencies. He et al.70 employed the concept of gradient vector flow (GVF) in a seeded region growing and region adjacency graph (RAG)-based merging architecture. Dynamic color gradient thresholding (DCGT) integrated with a growing-merging scheme was first seen in the work by Balasubramanian et al.71 The DCGT technique was used to guide a region growth procedure for the formation of an initial set of regions, which were further refined in a merging protocol. Both steps were performed purely based on color information. To this effect, the DCGT algorithm faced problems of oversegmentation due to the lack of a texture descriptor and was computationally expensive. Figure 12 portrays the segmentation outcomes achieved from the DCGT algorithm using default parametric values described in Balasubramanian et al.71

Graphic Jump LocationF12 :

Results of the DCGT algorithm.71

More recently, Ugarriza et al.72 proposed a novel Gradient SEGmentation (GSEG) algorithm, which simultaneously laid emphasis on the homogeneous and heterogeneous characteristics of image data using color-texture-gradient attributes and did not exclusively depend on the initial assignment of clusters to arrive at the final segmentation (see Fig. 13). Region formation in the GSEG method was done using a unique growing approach based on the principle of dynamic gradient thresholding that iteratively thresholded gradient information derived from the image, commencing at pixel clusters with small gradient magnitudes (gradually varying intensity) and culminating at pixels groupings possessing large gradient magnitudes (abrupt intensity variations) with no dependency in the order in which they were processed. Another aspect of their growth approach different from conventional approaches was the dynamic seed addition component that accommodated simultaneous growth of several adjacent and/or nonadjacent image regions. The resulting regions were optimized using an efficient region merging approach based on statistical analysis of a multivariate space involving the aforementioned attributes. The work in Ugarriza et al.72 was enhanced by Vantaram et al.73 who proposed a multiresolution extension of the GSEG methodology called MAPGSEG and demonstrated that the segmentation results of low-resolution images can be utilized to efficiently partition their corresponding high-resolution counterparts. Overall, the MAPGSEG framework was shown to achieve, in general, comparable segmentation results to the GSEG algorithm (as seen in Figs. 13 and 14) at half its computational expense. The parametric settings utilized to achieve the results depicted in Figs. 13 and 14 have been detailed in Ugarriza et al.72 and Vantaram et al.73

Graphic Jump LocationF13 :

Results of the GSEG algorithm.72

Graphic Jump LocationF14 :

Results of the MAPGSEG algorithm.73

Krinidis et al.74 instituted an approach for color texture image segmentation in a growing-merging schema based on a 3-D physics-based deformable surface model derived from intensity and spatial information of images. Color image segmentation using the dual tree complex wavelet transform (DT-CWT) integrated with a growing-merging strategy was seen in Celik et al.75 The partitioning process was initiated by the DT-CWT computation that enabled multiscale edge detection, wherein the acquired edges were subjected to binary morphological operations to locate suitable seed points. These seed points were employed in a region-growing approach to delineate an initial set of regions, which were fine-tuned in a subsequent merging process. Recently, Panagiotakis et al.76 devised a scheme for natural image segmentation in a growing-merging structure based on tree equipartion and Bayesian flooding processes for feature extraction. Additionally, several hybrid region-based approaches7784 have also been proposed.

In contrast to the segmentation approaches discussed in the last three subsections, energy-based segmentation techniques aim to minimize explicit cost functions. They can, in general, be classified into ones that explicitly utilize edge/contour-based energy (e.g., active contours) or ones that employ region-based energy to delineate different regions (e.g., Mumford-Shah formulation and Bayesian techniques like MRFs.

Active contours

Within the notion of using edge/contour-based energy, curve evolution methods involving active contours better known as “evolving fronts” have gained tremendous popularity over the last decade. From a high-level viewpoint, active contours can be categorized based on their implementation as being either parametric active contours (PACs) or geometric active contours (GACs).

PACs are generally represented in a Lagrangian formulation where the evolving curves are called “snakes,” a concept first introduced by Kass et al.85 A snake is defined as a curve or a deformable spline v(s)=[x(s),y(s)] that constantly moves/evolves based on a specific energy model E(v) until it attains a shape that best fits an object (or multiple objects) of interest in the scene. This energy functional typically comprises of internal (Eint[v(s)]) and external (Eext[v(s)]) energy terms as shown in Eqs. (5) and (6), whose combined effect drives a snake towards the boundary of an object resulting in the overall energy being minimized, given as: Display Formula

E(v)=01{Eint[v(s)]+Eext[v(s)]}ds(5)
Display Formula
E(v)=1201[α(s)|v(z)s|2+β(s)|2v(z)s2|2]ds+01Eext[v(s)]ds.(6)

In the aforementioned equations, (x,y) symbolizes the coordinates of a snake in the image domain, while s is proportional to its arc length. Furthermore, Eint[v(s)] is contour-dependent. It is utilized to control its tension and rigidity via parameters α(s), β(s), respectively, and is minimized when a snake possesses a shape that is in close proximity to the object of interest. On the other hand, Eext[v(s)] is explicitly calculated in the image domain and is minimized when the physical location of a snake is along the boundaries of the object of interest. Among PACs, there exists a class of snakes called region-based active contours given that they are designed to attract to boundaries that distinguish homogeneous regions. Since its inception, it has been uncovered that the traditional snake model suffers from two major drawbacks that derail it from converging on the desired object of interest. The first occurs when the contour initialization is far from the true object boundary, and the second is when the object of interest has cavities that are concave in nature. To overcome the first shortcoming, multiresolution methods and pressure forces, as well as several enhanced models such as balloon/distance snake models, have been proposed. On the other hand, methods involving GVF and directional snake models have been offered to account for the second deficiency. PACs have several merits over classical segmentation techniques such as: 1. they are self-accommodative in their pursuit for a global energy minimum, 2. they can be easily molded via the Eext[v(s)] term as needed, 3. they can be designed to be image scale dependent, and finally 4. they are not biased toward any particular object/region shape and consequently are effective for segmenting/tracking objects in spatio-temporal dimensions. Major potential demerits of PACs include: 1. brazing localized energy minima, 2. ignoring minor image features for global energy minimization, 3. focusing only a few regions at a time, and 4. relying on stringent convergence criteria for high accuracy. Dumitras et al.86 proposed a three-step algorithm using angular-map-driven snakes for shape description and extraction of objects in color imagery. The first step involved computation of an angular map using all color pixel vectors and a reference vector that characterizes color changes in the input image. This map is utilized as input to an edge detection protocol in the second stage of processing. Finally, the resultant edge map is presented to a snake model to segment the object of interest. Dumitras et al. experimented with distance and GVF snake models in their work. The use of PAC evolution based on a cubic smoothing spline for real time segmentation of images was first seen in the work of Precioso et al.87 Moreover, through this work Precioso et al.87 demonstrated that the choice of a smoothing spline approximation instead of spline interpolation makes a snake more robust to noise variations. More recently, Ozertem et al.88 introduced a nonparametric formulation of a snake-energy function using kernel density estimation that exploited the underlying kernel density estimate of the image data. Lankton et al.89 propounded a method on region-based active contours driven by localized region energy calculations for improved segmentation accuracy.

In comparison to PACs, GACs are implicitly represented in an Eulerian formulation where evolving curves are evaluated as the level sets of a distance function in two-dimensions, a theory first introduced for image segmentation by Malladi et al.90 based on the work originally done by Osher et al.91 The key idea of a level set-based segmentation method is to commence with a closed contour Γ in two dimensions, which is eventually made to propagate in a direction orthogonal to itself at a specific speed F, driven by a higher dimensional scalar function defined over the input image. Thus the evolving front at any location (x,y) is derived as the zero level set of the aforementioned scalar function at time instant t, mathematically represented as: Display Formula

Γ={Φ(x,y,t)Φ(x,y,t)=0}.(7)

Employing the chain rule on Eq. (7) and performing specific algebraic simplifications, the evolution of Φ (given the value of Φ(x,y,t)=0) can be expressed as: Display Formula

Φt+ΦF=0.(8)
Equation (8) is popularly referred to as the level set equation and serves as a useful tool to track the evolution of contours along images. Figure 15 shows the segmentation results obtained using an open-source level set toolbox92 using default parametric settings. The primary virtue of GACs over alternate contour energy-based approaches is that its implicit boundary formulation can efficiently undergo topological changes pertinent to splitting or merging. Consequently GACs are better suited for shape-invariant multiregion/object segmentation. A secondary asset of GACs over conventional schemes is its nonparametric nature that allows it to be generically used for disparate datasets.