# STATS 784 : Statistical Data Mining

## Science

### Course Prescription

Data cleaning, missing values, data warehouses, security, fraud detection, meta-analysis, and statistical techniques for data mining such as regression and decision trees, modern and semiparametric regression, neural networks, statistical approaches to the classification problem.

### Course Overview

This course was the first in the department on data mining and was (and still is) intended to be both practical and theoretical. Anybody wanting to use R for regression or classification on big data sets should benefit, as well as research students. So we will look at some statistical theory and practical aspects of data mining. This provides an opportunity to encounter some trendy methods such as random forests. There will be a significant coursework component, most of it being computer work. I might try let you work with at least one `large' data set. Students need to do some basic R programming. Students are required to have  a good background in statistics---both theoretically and computationally (R). The early diagnostic quiz will help assess your background---students MUST have a reasonably strong STATISTICS background, especially those with a computer science background. Note that the following topics from the Course Prescription will not be covered: data warehouses, security and meta-analysis.

### Course Requirements

Prerequisite: 15 points from STATS 210, 225, and 15 points from STATS 330, 762

### Capabilities Developed in this Course

 Capability 1: Disciplinary Knowledge and Practice Capability 2: Critical Thinking Capability 3: Solution Seeking Capability 4: Communication and Engagement

### Learning Outcomes

By the end of this course, students will be able to:
1. Master fundamental material such as the binary prefixes, big-Oh, and appreciate the role of data science in society. (Capability 1 and 4)
2. Critically evaluate and explain fundamental statistical concepts such as under- and over-fitting, parametric and nonparametric methods, and the curse of dimensionality, within the context of BigData. (Capability 2, 3 and 4)
3. Competently be able to fit 2 or 3 methods well, such as decision trees and generalized additive models, including smoothing as a very useful tool. (Capability 2, 3 and 4)
4. Use R efficiently to solve BigData problems, including traditional regression using S formulas, and graphics. (Capability 1, 3 and 4)

### Assessments

Assessment Type Percentage Classification
Assignments 28% Individual Coursework
Test 30% Individual Test
Final Exam 40% Individual Examination
Quizzes 2% Individual Coursework
1 2 3 4
Assignments
Test
Final Exam
Quizzes
A 2% online quiz will be given during Week 1 or 2 as a revision of statistical material needed to pass the course.  That is, statistical skills and statistical background knowledge that students will need in order to have a sufficiently strong statistical background. Students with the prerequisites should not find this difficult. Those doing poorly are strongly urged to do the course in the following year and/or another course during the current year.

### Tuākana

Tuākana Science is a multi-faceted programme for Māori and Pacific students providing topic specific tutorials, one-on-one sessions, test and exam preparation and more. Explore your options at
https://www.auckland.ac.nz/en/science/study-with-us/pacific-in-our-faculty.html
https://www.auckland.ac.nz/en/science/study-with-us/maori-in-our-faculty.html

### Key Topics

• What is data mining? A wide-ranging overview including unit-prefixes, big-Oh, missing values and multiple testing.
• Handling large data sets in R and Linux. The focus is on efficient computing and R programming.
• Data visualization (graphics) for big data sets.
• Decision trees. Many important statistical concepts are illustrated using regression and classification trees. We will also cover random forests.
• Vector generalized linear and additive models (VGLMs/VGAMs), especially for regression, estimation and prediction. VGLMs are a superclass of GLMs (GLMs are taught in STATS 20x and 330).
• Generally-altered, -inflated, -truncated and -deflated (GAITD) regression, especially for heaped and seeped data.
• The classification problem (time allowing).
• The following semi-new topics will be interleaved with the above: variable selection via the lasso, dimension-reduction, gradient boosting.

### Special Requirements

There is no plussage. Students should complete the assignments, test and exams as best they can. Attendance of lectures are expected for those in Auckland (including sitting the term test on Campus). Lecture recordings will be placed on Canvas.  There will be weekly tutorials and these should be attended to obtain problem solving practice.

This course is a standard 15 point course and students are expected to spend 150 hours per semester involved in each 15 point course that they are enrolled in. For this course you can expect 3 hours of lectures, a 1-hour tutorial, 2 hours of reading and thinking about the content and 5 hours of work on assignments and/or test preparation each week.

### Delivery Mode

#### Campus Experience

This course is primarily aimed at those in Auckland. Lectures will be available as recordings and these will be placed on Canvas so overseas students will need fast internet. For those in Auckland, attendance is expected at lectures but no credit is given for this. Other learning activities such as the weekly tutorials/labs will probably not be available as recordings. Attendance on campus is required for the test and/or exam for those in Auckland, and the (invigilated) test is scheduled to be during a class time around Week 6 or 7. The activities for the course are scheduled as a standard weekly timetable.

### Learning Resources

Course materials are made available in a learning and collaboration tool called Canvas which also includes reading lists and lecture recordings (where available).

Please remember that the recording of any class on a personal device requires the permission of the instructor.

A coursebook will be provided on Canvas chapter by chapter as the course progresses... only this material will be examinable. It will be in .pdf format. Suggested reading lists will be given at the end of each chapter but this is very optional. A study guide at lecture 1 will list some optional overall background reading. Past exams and tests will be available but note that the course has evolved over time and the current material covered has changed significantly.

### Student Feedback

During the course Class Representatives in each class can take feedback to the staff responsible for the course and staff-student consultative committees.

At the end of the course students will be invited to give feedback on the course and teaching through a tool called SET or Qualtrics. The lecturers and course co-ordinators will consider all feedback.

Your feedback helps to improve the course and its delivery for all students.

Errors found in the previous year's notes are corrected for this year's notes. Improvements and updates to the notes are also made.

### Other Information

As mentioned elsewhere, depending on the proportion of the class who are overseas, we will try to have at least one invigilated test and exam on campus.  Consequently there might be a little difference between students in NZ versus overseas, but I'll try to minimize any such differences.

The University of Auckland will not tolerate cheating, or assisting others to cheat, and views cheating in coursework as a serious academic offence. The work that a student submits for grading must be the student's own work, reflecting their learning. Where work from other sources is used, it must be properly acknowledged and referenced. This requirement also applies to sources on the internet. A student's assessed work may be reviewed against online source material using computerised detection mechanisms.

### Class Representatives

The content and delivery of content in this course are protected by copyright. Material belonging to others may have been used in this course and copied by and solely for the educational purposes of the University under license.

You may copy the course content for the purposes of private study or research, but you may not upload onto any third party site, make a further copy or sell, alter or further reproduce or distribute any part of the course content to another person.

### Inclusive Learning

All students are asked to discuss any impairment related requirements privately, face to face and/or in written form with the course coordinator, lecturer or tutor.

Student Disability Services also provides support for students with a wide range of impairments, both visible and invisible, to succeed and excel at the University. For more information and contact details, please visit the Student Disability Services’ website

### Special Circumstances

If your ability to complete assessed coursework is affected by illness or other personal circumstances outside of your control, contact a member of teaching staff as soon as possible before the assessment is due.

If your personal circumstances significantly affect your performance, or preparation, for an exam or eligible written test, refer to the University’s aegrotat or compassionate consideration page .

This should be done as soon as possible and no later than seven days after the affected test or exam date.

### Learning Continuity

In the event of an unexpected disruption, we undertake to maintain the continuity and standard of teaching and learning in all your courses throughout the year. If there are unexpected disruptions the University has contingency plans to ensure that access to your course continues and course assessment continues to meet the principles of the University’s assessment policy. Some adjustments may need to be made in emergencies. You will be kept fully informed by your course co-ordinator/director, and if disruption occurs you should refer to the university website for information about how to proceed.

The delivery mode may change depending on COVID restrictions. Any changes will be communicated through Canvas.

The delivery mode may change depending on COVID restrictions. Any changes will be communicated through Canvas.

### Student Charter and Responsibilities

The Student Charter assumes and acknowledges that students are active participants in the learning process and that they have responsibilities to the institution and the international community of scholars. The University expects that students will act at all times in a way that demonstrates respect for the rights of other students and staff so that the learning environment is both safe and productive. For further information visit Student Charter .

### Disclaimer

Elements of this outline may be subject to change. The latest information about the course will be available for enrolled students in Canvas.

In this course students may be asked to submit coursework assessments digitally. The University reserves the right to conduct scheduled tests and examinations for this course online or through the use of computers or other electronic devices. Where tests or examinations are conducted online remote invigilation arrangements may be used. In exceptional circumstances changes to elements of this course may be necessary at short notice. Students enrolled in this course will be informed of any such changes and the reasons for them, as soon as possible, through Canvas.

Published on 01/11/2022 09:38 a.m.