Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related, but also healthcare data) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions).
The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. Recently, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business Data Mining (e.g., Classification Trees), but Data Mining is still based on the conceptual principles of statistics including the traditional Exploratory Data Analysis (EDA) and modeling and it shares with them both some components of its general approaches and specific techniques.
However, an important general difference in the focus and purpose between Data Mining and the traditional EDA is that Data Mining is more oriented towards applications than the basic nature of the underlying phenomena. In other words, Data Mining is relatively less concerned with identifying the specific relations between the involved variables. For example, uncovering the nature of the underlying functions or the specific types of interactive, multivariate dependencies between variables are not the main goal of Data Mining. Instead, the focus is on producing a solution that can generate useful predictions. Therefore, Data Mining accepts among others a "black box" approach to data exploration or knowledge discovery and uses not only the traditional EDA techniques, but also such techniques as Neural Networks which can generate valid predictions but are not capable of identifying the specific nature of the interrelations between the variables on which the predictions are based.
In healthcare areas, data mining has been widely used in bio-informatics; signal detection for bio-terrorist events, adverse events and product complaints reporting; disease registries; health insurance databases; and government (Medicare, Medicaid, VA, NIH, FDA, etc) databases. For bio-informatics, huge amounts of data from macro-arrays have been explored to find active compounds for specific therapeutic areas. For signal detection, unusual patterns are trended and alarmed among millions of reported observations. Others have multiple purposes such as finding relationships, select better cares for certain rare diseases, and forming better policies.
Data Mining is often considered to be "a blend of statistics, AI [artificial intelligence], and data base research", which until very recently was not commonly recognized as a field of interest for statisticians. Due to its applied importance, however, the field is emerging as a rapidly growing and major area in statistics where important theoretical advances are being made. (cf. International Conferences on Knowledge Discovery and Data Mining, co-hosted by the American Statistical Association). Please contact us to see how Walker Downey & Associates, Inc. Statistical Services can help your company through Data Mining.Case Studies