What Is Data Mining?

Data mining, as we use the term, is the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules. We assume that the goal of data mining is to allow a corporation to improve its marketing, sales, and customer support operations through a better understanding of its customers.

Keep in mind, however, that the data mining techniques and tools described here are equally applicable in fields ranging from law enforcement to radio astronomy, medicine, and industrial process control. In fact, hardly any of the data mining algorithms were first invented with commercial applications in mind.

The commercial data miner employs a grab bag of techniques borrowed from statistics, computer science, and machine learning research. The choice of a particular combination of techniques to apply in a particular situation depends on the nature of the data mining task, the nature of the available data, and the skills and preferences of the data miner.

Data mining comes in two flavors—directed and undirected. Directed data mining attempts to explain or categorize some particular target field such as income or response. Undirected data mining attempts to find patterns or similarities among groups of records without the use of a particular target field or collection of predefined classes.

Data mining is largely concerned with building models. A model is simply an algorithm or set of rules that connects a collection of inputs (often in the form of fields in a corporate database) to a particular target or outcome. Regression, neural networks, decision trees, and most of the other data mining techniques are techniques for creating models.

Under the right circumstances, a model can result in insight by providing an explanation of how outcomes of particular interest, such as placing an order or failing to pay a bill, are related to and predicted by the available facts. Models are also used to produce scores.

A score is a way of expressing the findings of a model in a single number. Scores can be used to sort a list of customers from most to least loyal or most to least likely to respond or most to least likely to default on a loan. The data mining process is sometimes referred to as knowledge discovery or KDD (knowledge discovery in databases). We prefer to think of it as knowledge creation.

What Tasks Can Be Performed with Data Mining?

Many problems of intellectual, economic, and business interest can be phrased in terms of the following six tasks:

  • Classification
  • Estimation
  • Prediction
  • Affinity grouping
  • Clustering
  • Description and profiling

The first three are all examples of directed data mining, where the goal is to find the value of a particular target variable. Affinity grouping and clustering are undirected tasks where the goal is to uncover structure in data without respect to a particular target variable. Profiling is a descriptive task that may be either directed or undirected.