Informationsutvinning (Data mining) - 1DL105, 1DL111
Fall 2007
Contents
- News
- Literature
- Assignments
- Teachers
- Schedule
- Goal, content and prerequisites
- Organization and examination
- OH slides and compendium
- Reading instructions
- Miscellaneuos information
- F.A.Q.
News
- 2008-01-14 - The results from the final exam can now be found here examresults-071214.txt /KO
- 2007-12-12 - NOTE! For the final exam you are allowed to bring 1 A4 single-sided sheet of paper with your own hand-written notes. This notes sheet should be handed in together with your exam. /KO
- 2007-12-11 - Updated slides and additional exams are now available. /KO
- 2007-11-25 - NOTE!!! I will hold lecture tomorrow Friday at 15.15-17.00 and move the tutorial to next week together with some rescheduled lectures. We will discuss next weeks schedule in class tomorrow. /KO
- 2007-11-25 - NOTE!!! Tomorrows lecture 10.15-12.00 cancelled since I still have not recovered from my stomach flu. I will get back with rescheduled lectures as soon as possible. /KO
- 2007-11-23 - NOTE!!! Sorry, I have to cancel todays lecture at 15.15-17.00 as well. I will book new lectures when we meet next week. /KO
- 2007-11-22 - NOTE!!! Todays lecture at 15.15-17.00 is cancelled due to sickness. Hope to be back tomorrow. /KO
- 2007-10-18 - Slides for chapter 1 and 2 updated / KO
- 2007-10-18 - The first day of class is 2007-10-25 at 10.15 in room 1211. Welcome! / KO
- 2007-10-18 - this page made accesible /KO
Literature
-
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar: Introduction to Data Mining (or the int. ed.) 1st Edition, Addison-Wesley, 2006 (available e.g. at Akademibokhandeln).
Additional Required Reading Material:
- Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu.
A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pages 226-231, 1996.
- Sundipto Guha, Rajeev Rastogi, and Kyuseok Shim. CURE: An Efficient Clustering Algorithm for Large Databases. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pages 73-84, June 1998.
- Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining Associations between Sets of Items in Large Databases. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pages 207-216, May 1993.
- Rakesh Agrawal, Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In Proceedings of the 20th International Conference on Very Large Databases, pages 487-499, September 1994.
- Jong Soo Park, Ming-Syan Chen, and Philip S. Yu. An Effective Hash Based Algorithm for Mining Association Rules. (Also available in PDF.) In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pages 175-186, May 1995.
- Rakesh Agrawal and Ramakrishnan Srikant. Mining Sequential Patterns. In Proceedings of the International Conference on Data Engineering (ICDE), pages 3-14, March 1995.
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report 1999-66, Stanford University, 1998.
- David Gibson, Jon Kleinberg and Prabhakar Raghavan. Inferring Web Communities from Link Topologies. In Proceedings of ACM Hypertext'98: Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space - Structure in Hypermedia Systems, pages 225-234, June 1998.
Additional Recommended Reading Material:
- Tomasz Imielinski and Heikki Mannila. A Database Perspective on Knowledge Discovery. In Communications of the ACM, 39(11):58-64, November 1996.
- Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Web Mining: Information and Pattern Discovery on the World Wide Web. In Proceedings of the 9th International Conference on Tools with Artificial Intelligence, pages 558-567, November 1997.
- Rakesh Agrawal, Ramakrishnan Srikant. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the Fifth International Conference on Extending Database Technology (EDBT), pages 3-17, March 1996.
- Sergei Brin and Lawrence Page.
The
Anatomy of a Large-Scale Hypertextual Web Search Engine. In
Proceedings of the Seventh International World Wide Web Conference
(WWW7). Also in the Journal of Computer Networks and ISDN Systems,
30(1-7):107-117, 1998.
Privacy-related material:
- Data Mining: Staking a Claim on Your Privacy. Office of the Information and Privacy Commissioner, Ontario, 1998.
- David Banisar. Privacy and Human Rights 2000: An international survey of privacy laws and developments. Electronic Privacy Information Center, 2000.
- Pretty Poor Privacy: An Assessment of P3P and Internet Privacy. Electronic Privacy Information Center, June 2000.
Assignments
- Instructions and material for the database assignments are found here.
Teachers
- Kjell Orsborn, examinator, lecturer, email: kjell.orsborn at it dot uu dot se, phone: 471 1154, room 1321
- Per Gustafsson, course assistant, email: per.gustavsson at it dot uu dot se, phone: 471 3155, room 1310
- Tobias Lindahl, course assistant, email: tobias.lindahl at it dot uu dot se, phone: 471 3168, room 1310
Schedule
| No: | Subject: | Ch: | Tchr: |
| | | | |
| L1 | Introduction to data mining | | KO |
| L2 | Overview of data in data mining | KO | |
| | | ||
| T1 | Intro to MATLAB / Tutorial on assignment 1 | PG, TL | |
| L3 | Overview of data in data mining continued ... | KO | |
| L4 | Classification | | KO |
| | |||
| L5 | Classification continued ... | | KO |
| L6 | Classification continued ... | | KO |
| T2 | Tutorial on assignment 2 | PG, TL | |
| | |||
| L7 | Clustering: partition methods | | KO |
| L8 | Clustering continued: hieararchical methods | KO | |
| T3 | Tutorial on assignment 3 | PG, TL | |
| | |||
| L9 | Association rules - frequent item sets | | KO |
| L10 | Association rules - fast algorithms and rule generation | KO | |
| L11 | Association rules - continued ... | | KO |
| T4 | Tutorial on assignment 4 | PG, TL | |
| | |||
| | |||
| L12 | Mining sequential patterns | KO | |
| L13 | Web content mining | KO | |
| | |||
| | |||
| L14 | Search engines | | KO |
| L15 | Data mining and privacy | KO | |
| Exam | Final exam in the "Skrivsalen" hall at Polacksbacken | |
Goal, content and prerequisites
Goal: (to be completed)
Content: Data mining, or knowledge discovery from data repositories, has during the last few years emerged as one of the most exciting fields in computer science. Data mining aims at finding useful regularities in large data sets. Interest in the field is motivated by the growth of computerized data collections which are routinely kept by many organizations and commercial enterprises, and by the high potential value of patterns discovered in those collections. For instance, bar code readers at supermarkets produce extensive amounts of data about purchases. An analysis of this data can reveal previously unknown, yet useful information about the shopping behavior of the customers.
Data mining refers to a set of techniques that have been designed to efficiently find interesting pieces of information or knowledge in large amounts of data. Association rules, for instance, are a class of patterns that tell which products tend to be purchased together. There is currently a large commercial interest in the area, both for the development of data mining software and for the offering of consulting services on data mining.
In this course we explore how this interdisciplinary field brings together techniques from databases, statistics, machine learning, and information retrieval. We will discuss the main data mining methods currently used, including data cleaning, clustering and classification techniques, algorithms for association rule mining, text indexing and seaching algorithms, how search engines rank pages, and recent techniques for web mining and for privacy-preserving data mining. Designing algorithms for these tasks is difficult because the input data sets are typically very large, and the tasks may be very complex. One of the main focuses in the field is the integration of these algorithms with relational databases and the mining of information from semi-structured data. We will examine the additional complications that come up in this case.
Topics covered:
- Introduction to Data Mining
- Classification Techniques
- Clustering Techniques
- Association Rules
- Web Mining
- Search Engines
- Data Mining and Privacy
Prerequisites: (to be completed)
Organization and examination
This course is organised as a series of lectures, tutorials and with an accompanying series of mandatory assignments (labs) to be solved with the help of a computer. The practical assignments are made by the students on their own with support from course assistants.
Most of the content of the course will be covered in the lectures and in the assignments, but it is nevertheless necessary to use your own time to read the course literature and to work with the course material and the assigments.
The course will have a total of four assignments: one on classification, one on clustering, one on association rules, and one on web mining. Students taking the course for 5 rather than 4 points will need to do an extra sub-assignment for the third assigment. On all assignments, you can work in pairs. Assignment deadlines are strict but, if you really need it, you are allowed to be late on one (but only one) assignment. Besides assignments, there will also be a written final examination, schedule below.
The requirements for passing the course is to pass the mandatory assignments and the written exam. For the assignment part of the course you will get a final pass or fail grade for the complete set of assignments taken together (individual assignments that are incomplete or failed are normally returned to the student for completion). The grade of the written exam will become the resulting grade for the course, i.e. some of Failed (U), Passed (G) or Passed with honour (VG), for swedish grades (except Failed (U), 3, 4 or 5 for Civilingenjörsprogram). Exchange and masters students can also get ECTS grades.
The written exam (tentamen) is given on Friday, 2005-12-14, from ??.00 to ??.00 in the "Skrivsalen", Polacksbacken.
Time and place for the next exam is not available at the moment and is announced later.
Examples on old exams and suggested solutions are available here: (to be completed)
Final exam 2005-12-19: dut-tenta051219.pdf
Final exam 2006-08-17: dut-tenta060817.pdf
Final exam 2006-12-15: Exam-DUT-061215.pdf
Some suggested soloutions for the 2006-12-15 exam id found here.
Final exam 2007-04-16: Exam-DUT-070416-ans.pdf
Final exam 2005-08-16: Exam-DUT-070816-ans.pdf
OH slides
Lecture Slides
Reading instructions
- Reading instructions for the the book "Introduction to Data Mining" by Tan, Steinbach, and Kumar (1st Edition, Addison-Wesley, 2006).
- Note that supplementary material, such as overhead slides and articles, are also part of the course material.
| Chapter: | Note: |
| 1 | All |
| 2 | 2.1, 2.4 |
| 3 | |
| 4 | 4.1, 4.2, 4.3, 4.5 |
| 5 | 5.1, 5.2. |
| 6 | 6.1, 6.2, 6.3, 6.4, 6.7 |
| 7 | 7.3, 7.4 |
| 8 | 8.1, 8.2, 8.3, 8.4 |
| 9 | 9.5 |
| Miscellaneous | paper 1, 2, 3, 4, 6, 7, 8 |
Miscellaneous information
F.A.Q.
Q: Is this section used for answering frequently asked questions?
A: Yes!
