Book Reviews

Data Mining: Multimedia, Soft Computing, and Bioinformatics

[+] Author Affiliations
Jiebo Luo

Kodak Research Laboratories.

J. Electron. Imaging. 15(1), 019901 (March 03, 2006). doi:10.1117/1.2179076
History: Published March 03, 2006
Text Size: A A A

Open Access Open Access

Data mining is truly the destined child of the explosive demand in data analysis and the rapid advance in computing technology. Dating back to the 1970s, statisticians have been using computers as a means to prove or disprove hypotheses on collected data, for example, using linear regressions in drug approval and credit approval. However, with the explosion in the amount of data and dimensionality of the data, the hypothesize-and-test paradigm becomes intractable.

The increasing ability of the computer to store vast amounts of data is a major propeller in the development of data mining. It is now common to store and query terabytes and even petabytes of data in sophisticated database management systems. However, human capability to analyze such massive data, even with the help of sophisticated visualization mechanisms, is akin to searching for needles in a haystack.

A key enabler in data mining’s development is pattern recognition and machine learning. As opposed to statistical techniques that require the user to build a hypothesis first, modern artificial intelligence algorithms are capable of automatically analyzing data and identifying relationships among attributes and entities in the data. This allows domain experts, who are not necessarily statisticians, to understand the relationship between the attributes and the classes.

As a result of these developments, data mining flourished during the past decade. Retail companies eagerly applied complex analytical capabilities to their data to increase their customer bases. The financial community found trends and patterns to predict fluctuations in interest rates, stock prices, and economic demand. For researchers and engineers, and students in the field of signal and image processing, the two main draws of data mining, naturally, are multimedia and bioinformatics.

Due to both of its theoretical and application appeals, conferences and workshops dedicated to data mining also mushroomed. At the same time, a text in data mining rapidly became a necessity in computer science and electrical engineering curricula. Data mining actually encompasses several technologies, including data management, statistics, machine learning, pattern recognition, and visualization.

There have been a number of books dedicated to data mining. A great number of them are nontechnical in that they contain more about the hype without getting much into how data mining algorithms actually work. There are also a couple of technical textbooks on data mining that are in fact mistitled books on machine learning even though the latter is a major part of data mining. This book is among the newer texts, covering a wide array of topics for a good overview while providing enough details on the key elements and aspects, albeit with certain personal biases—the book takes pride on being the “first title to ever present soft computing approaches and their application in data mining, along with the traditional hard-computing approaches,” addressing “the principles of multimedia data compression techniques (for image, video, text) and their role in data mining.”

Chapter 1 provides an excellent introduction to the basics and major applications of data mining. In a sense, this chapter is a condensed version of the entire book, making it perfectly suitable for readers who are new to the field to get a bird’s-eye view of the field.

Chapter 2 introduces soft computing and many of its tools, including the three main ones in fuzzy sets, artificial neural networks, and genetic algorithms. The authors also discuss the role of hybrid systems. One noticeable omission here is Bayesian belief networks, which is a powerful knowledge discoverer tool that is different from traditional hard-computing approaches (the belief networks are only glossed over in a subsection in Chap. 5).

One thing unique about this text is that Chap. 3 is devoted to multimedia data compression. While the authors make a valid point that data compression is an integral part of a data mining system, a separate chapter on this subject seems a bit out of place; one could argue for the inclusion of database management by the same token. One danger in over-emphasizing data compression is that while data reduction is an important preprocessing task in data mining, unseasoned readers may get the wrong idea that data compression always leads to good discrimination. Using principal component analysis (PCA) as an example, PCA is aimed at energy compaction and therefore suitable for data compression. However, while PCA is widely used in pattern recognition problems for data reduction, it is not necessarily the right feature extraction means for all data mining tasks. It is easy to find cases where energy compaction is irrelevant to the discrimination task; features that have low variances can be effective for classification but are easily overwhelmed in PCA.

The principles of string matching are described in Chap. 4, along with a number of classic algorithms such as finite automata and Boyer-Moore algorithm. String matching is expected to be quite useful in bioinformatics (e.g., DNA sequence analysis) and certain multimedia applications involving text and symbols.

The core of the book is in Chap. 5 through 8, concentrating on classification, clustering, and rule mining. Many of these topics and techniques have been covered to various extents by numerous other books on pattern recognition and machine learning. Recognizing this aspect, the authors strive to incorporate new algorithms and results based on soft computing and advanced signal processing techniques. In terms of classification in data mining, the book goes through decision trees, Bayesian classifiers, instance-based learners (e.g., k-NN classifier), support vector machines, and fuzzy decision trees in Chap. 5. As for clustering in data mining, Chap. 6 discusses distance measures, scalable clustering algorithms, before turning the readers’ attention to soft-computing-based approaches based on fuzzy sets, neural networks, rough sets, and evolutionary algorithms. This chapter also covers clustering with categorical attributes (STIRR, ROCK, and c-mode) and hierarchical symbolic clustering. Association rules are discussed in Chap. 7, including hypothesis generation and test methods, depth-first search methods, interestingness rules, as well as implementations such as multilevel rules, online rule generation, rule generalization, and scalable mining of rules. A number of other variants as well as fuzzy association rules are described toward the end. Finally, rule mining with soft computing, ranging from connectionist rule generation models to a number of modular hybrid systems, is singled out in a separate Chap. 8.

The two main application arenas of data mining are discussed in good length in Chaps. 9 and 10. In particular, Chap. 9 deals with text mining, image mining, video mining, and web mining issues, all in the prime light of content-based information retrieval. The authors’ notion of data mining in compressed domain is mostly relevant to multimedia. Bioinformatics, the merging and booming hotbed of data mining, is introduced in the final chapter covering biological and information science aspects, microarray data clustering, and association rules for revealing connections between genes, gene expression levels, and disease. The role of soft computing is emphasized for the last time in predicting protein structures and classifying gene expression data.

This is not a “cookbook” that provides pseudocode algorithms enabling anyone to implement data mining algorithms. Instead this book is a well-written guide for understanding the fundamentals. Those skilled in the art and practice of data mining may find the book limiting in terms of the level of details, though the extensive references would allow them to acquire these. The lack of exercises and problems, which are critical components of a textbook, are mitigated somewhat by examples integrated with the text.

The authors state that this book may be used as a part of a graduate-level course or as a reference book for professionals and researchers. I agree with this recommendation. It would not be appropriate to use the book as the main text for electrical engineering or computer science students, mainly because it does not contain exercises. It could suit nonengineering students because the material is well organized and can easily be broken into a number of lecture plans and study materials. Certainly, readers need to have adequate background in college-level mathematics and basic statistics and probability theory. The instructor will need to devise problems and exercises, while perhaps providing a toolbox for the students to acquire some level of hands-on experience. I would recommend this book for students and others who wish to gain a good understanding of data mining. In short, this book is an excellent primer on the subject of data mining with an accessible introduction to fundamental and advanced data mining technologies. One criticism for this and almost all data mining texts is that data processing, data interpretation, and data visualization form a tripod for data mining, but visualization certainly warrants more attention and coverage.

Jiebo Luo is a senior principal scientist with Kodak Research Laboratories. His research interests include image processing, pattern recognition, computer vision, and multimedia communication. He has authored over 100 technical papers and holds over 30 granted US patents. Dr. Luo is a senior member of IEEE and active participant in professional activities including the editorial boards of journals and technical committees of conferences. He is also an adjunct faculty at the Rochester Institute of Technology and the mentor of over a dozen graduate students from various universities.

, 424 pages. ISBN . John Wiley & Sons, Hoboken, New Jersey (2003), hardcover.

Citation

Sushmita Mitra ; Tinku Acharya and Jiebo Luo
"Data Mining: Multimedia, Soft Computing, and Bioinformatics", J. Electron. Imaging. 15(1), 019901 (March 03, 2006). ; http://dx.doi.org/10.1117/1.2179076


Figures

Tables

References

Some tools below are only available to our subscribers or users with an online account.

Related Content

Customize your page view by dragging & repositioning the boxes below.

Related Book Chapters

Topic Collections

Advertisement
  • Don't have an account?
  • Subscribe to the SPIE Digital Library
  • Create a FREE account to sign up for Digital Library content alerts and gain access to institutional subscriptions remotely.
Access This Article
Sign in or Create a personal account to Buy this article ($20 for members, $25 for non-members).
Access This Proceeding
Sign in or Create a personal account to Buy this article ($15 for members, $18 for non-members).
Access This Chapter

Access to SPIE eBooks is limited to subscribing institutions and is not available as part of a personal subscription. Print or electronic versions of individual SPIE books may be purchased via SPIE.org.