Data Mining

Data mining has become respectable, useful, even essential.

Data mining is based on highly automated statistical algorithms that are able to identify reproducible patterns in “wide” data sets. Wide data sets have many columns (or variables), often more columns than rows.

If you have any of the following problems, then data mining can help:
• Your models fit well when you build them, but do poorly when implemented.
• You have so many variables, you donʼt know where to begin.
• Your predictions just are not accurate enough to beat the competition.

Rather than building a model that relies on a few carefully chosen measurements in an experiment, data mining commonly involves a search for patterns from a wide dataset. These searches might consider 100,000 or more features, looking for the few that predict the response. Data mining is also useful even if you donʼt have the worldʼs largest data set and just want to make better use of the information you do have.

Learn data mining by focusing on applications.

This course takes you through cases that use data mining to solve real-world problems like these:
• Identify patients who are most at risk of a disease
• Pick out the best prospective job candidates
• Predict which credit applications are fraudulent.

Success with data mining requires more than fitting a model. The analyses of these and other cases start by identifying the problem, digging through the relevant data, and wrapping up with a discussion of how to present the results to others. Better predictions wonʼt solve any problems unless you can present those results in a way that others can grasp, appreciate, and act on.

The course leaves you able to start using these methods right away. Data mining does not require exotic hardware or software. Once you understand how to use regression for data mining, youʼll be able to appreciate the strengths and weaknesses of newer methods. With that in mind, the class starts with a data minerʼs view of regression and then moves to neural networks, classification and regression trees, boosting, random projections, and support vector machines. Real-time demonstrations in class use JMP (the interactive software from SAS) and R.

WHO SHOULD ATTEND?

This course is designed for researchers and data analysts with a modest statistical background who want to use data mining methods on their own data sets. No previous background in data mining is necessary. But participants should have a good working knowledge of the basic principles of statistical inference (e.g., standard errors, hypothesis tests, confidence intervals), and should also have a good understanding of the basic theory and practice of linear regression. Some acquaintance with logistic regression is also helpful.

Speaker and Presenter Information

Robert Stine Robert Stine, Ph.D., is Professor of Statistics in the Wharton School of the University of Pennsylvania. His research spans a variety of areas with many practical applications, ranging from forecasting to spatial models to fundamentals of multiple testing.