Statistics 487:
Topics in Statistics:
Multivariate statistics through Data Mining
Spring 2007 Office Hours:
MWF 8:30 -11:30 AM
(plus appointments)
Description: How can we find meaning in giant data sets? Multivariate Statistics and Data Mining are both ways of attacking them. Through the use of appropriate algorithms and matrix manipulation, we will find what we can.
Objectives: Students who successfully complete this course will:
- Gain an appreciation for how matrix algebra is used in statistics.
- Investigate the limitations of hypothesis testing for large and complex data sets
- Have a basic understanding of a variety of multivariate statistical methods using modern algorithms, tools and applications.
- Achieve a moderate level of proficiency with the open source statistical software R.
- Synthesize knowledge learned in this class to examine real world data sets from life sciences, economics, psychology, and/or local Truman assessment data.
Prerequisite: A Statistics course numbered 290 or higher and one of Math 285 or 357. Computer programming skills are helpful but not required.
Textbook: Izenman, Alan (2007). Data Mining through Multivariate Statistics. This is a draft textbook, so you will get PDF files/photocopies of it. On the plus side, it's cheap. On the other hand, we get what we pay for, so be prepared for errors. Think of us as beta-testers.
Technology: This course will use various software packages, and using multiple software packages on a single data set is one of the skills we will develop. The course will use R (an open source statistics package) for day-to-day computation, matrix manipulation, and statistics. Mathematica, Java, C++, and/or SPSS are likely to be used occasionally in class, or for parts of some independent projects, while others may be coded in software unknown to the instructor. We might even use Excel sometimes.
Evaluation: Course grades are determined by combining all of the points for the semester. A mediocre student should expect a C, although I hope that none of you are mediocre students. I plan to use a straight percentage system, but I'll leave open the possibility of a small curve applied at the end.
Grades will be based on the following 1000-point system:
|
Type of Grade |
Number |
Points |
Total |
|
Homework Assignments |
5 |
50 |
250 |
| Participation | 100 | ||
| Seminar | 100 | ||
|
Midterm Exams |
2 + redo | 100 | 250 |
|
Project / Presentation / Final |
|
|
300 |
Homework: will be assigned periodically. Homework is due at the beginning of class on the due date (usually a Wednesday). Homework turned in late will receive a penalty of 20% if it is turned in by the next class, 50% if it is up to a week late, and 10% more each week after that; It will also be graded more harshly, giving less benefit of the doubt. Once assigned, homework will be posted on the U: Drive (and maybe on BlackBoard).
Students are invited to meet with each other to talk about their homework problems and other topics in the class. However, I would like you to write up your own solutions, without copying directly from another person's paper. Always explicitly list your homework partners on your homework. We'll try to have a couple group homework meetings, probably Monday evenings.
Seminar: You and a partner will teach class for a day on a topic or problem. You will get a list of topics to choose from, but it is fairly open. You will be expected to have a handout, slides or good chalkboard skills, and your classmates will expect a relevant homework problem from your lecture.
Midterm Exams: This course will have two mid-semester exams. The first will be an in-class exam on Wednesday, February 21. The second, a take-home exam, will be due April 4. You will be able to revise and resubmit the take-home exam as a third exam grade.
Project/Presentation: This class has an project, tentatively due April 13. For this project, you (with one or two others) will take a giant data set (you can find your own, or I will have several), and you will find meaning in it. The final exam will consist of project presentations.
Participation: It is assumed that everyone will come to class all of the time, read ahead, think about things, and speak in class (sometimes about non-class things). If you are in class most of the time, almost never talk in class, and work on homework by yourself, expect a C+ in participation.
Honor: It is also assumed that everyone will behave in a trustworthy and honest manner throughout this course. I give a great benefit of the doubt as long as possible, but this class has no room for those who won't play fair. If you have a question about the "house rules," ask me. Any cheating, plagiarism, or other trust-breaking will result in failing the course. Period.
Tentative schedule: (subject to change)
|
Week |
Date |
Topics |
Text |
|
1 |
January 8, 10 and 12 |
Introduction and Databases |
1-2 |
|
2* |
January 17 and 19 |
Review of Matrices |
3 |
|
3 |
January 22, 24 and 26 |
Regression made easy |
5-6 |
|
4 |
Jan 29, 31, and Feb 2 |
Principal Components |
7 |
|
5* |
February 5 and 9 |
Disc. Analysis | 8 |
|
6 |
February 12, 14, and 16 |
More DA + Factor Analysis | |
|
7 |
February 19, 21, and 23 |
Q-Sorts + Exam 1 | |
|
8 |
Feb. 26, 28, and Mar 2 |
Decision Trees |
9 |
|
|
Spring Break |
Spring Break |
|
|
9 |
March 12, 14, and 16 |
Artificial Neural Networks |
10 |
|
10 |
March 19, 23, and 25 |
Cluster Analysis | 12 |
|
11 |
March 26, 28, and 30 |
more Cluster Analysis | |
|
12 |
April 2, 4, and 6 |
Additional Topics | |
|
13* |
April 11 and 13 |
Seminar Presentations | |
|
14 |
April 16, 18, and 20 |
Seminar Presentations |
|
|
15 |
April 23, 25, and 27 |
Ghost Week/Conclusion |
|
|
|
Monday, April 30 |
Final Exam / Project Presentations |
3:30-5:30 |
Ghost Week: Dr Alberts and his wife (actually, mostly her) are planning to have a baby sometime around April 15. He will be taking a week off around then (although the schedule marks it at the end). Be prepared to work on your project for a week with minimal supervision, although Dr. Beck will be probably be able to help. Your seminar dates may also have to slide.