• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Data Science and Machine Learning

2019/2020
Academic Year
ENG
Instruction in English
3
ECTS credits
Delivered at:
Department of Applied Mathematics and Informatics (Faculty of Informatics, Mathematics, and Computer Science (HSE Nizhny Novgorod))
Course type:
Elective course
When:
3 year, 3, 4 module

Instructor


Durandin, Oleg

Course Syllabus

Abstract

The course introduces to the students some basic approaches and principles of data mining, the main methods of machine learning and the limits of these methods, the main methods of the quality evaluating
Learning Objectives

Learning Objectives

  • The purpose of the course is to familiarize students with the basic principles and methods of data analysis.
Expected Learning Outcomes

Expected Learning Outcomes

  • Has an understanding of the spectrum of machine learning tasks
  • Understands the most important principles of EDA, is able to prepare data for machine learning algorithms
  • Knows how to train linear regression, understands its quality metrics
  • Is able to train polynomial regression and understand its quality metrics, to identify overfitting and underfitting, to estimate quality during cross-validation
  • Able to train logistic regression and KNN, understand quality metrics.
  • Has an idea of classification based on decision trees and ensemble models, knows how to train them
  • Has an idea of classification based on SVM and various parameters of the model, is able to train the model
  • Has the idea of the "curse of dimensionality", is able to reduce the dimensionality with various methods
  • Has an understanding of the clustering problem and various algorithms, is able to train clustering models, understands clustering evaluation
Course Contents

Course Contents

  • Introduction. Examples of practical tasks.
    Meaningful problem statements of data mining and machine learning. Linking with other fields and practice. The basic terminology of machine learning tasks. The concepts of supervised learning and unsupervised learning. Review of educational materials and Internet resources on the subject of the discipline.
  • (Exploratory Data Analysis (EDA)
    Exploratory data analysis as the first phase of data mining. The basic properties of the data. Measures of central tendency. Probability distribution. Concepts of outliers and anomalies in data. Data preprocessing. The problem of missing values in the data.
  • Linear regression.
    Supervised machine learning problems. The concept of regression problem. Linear regression. Methods for solving regression – Least Squares Method and gradient descent. The concept of loss function and the concept of quality metrics: mean absolute error, mean square error.
  • Polynomial regression. The concept of overfitting and regularization.
    The concept of overfitting and underitting . The trade-off between bias and variance as a major machine learning problem. The concept of training and test samples. Cross-validation. Examples of overfitting and underitting. Introduction of nonlinear attributes into the regression problem. Polynomial regression. "Symptoms" of overfitting. Methods to deal with overfitting. The concept of regularization. "Lasso" and "ridge" regularized regressions.
  • Classification problem. Logistic regression. The kNN algorithm.
    Classification problem. Quality metrics in classification problems. The problem of unbalanced sampling. The concepts of accuracy, precision and recall. Multiclass classification. Logistic regression. The concept of the decisive boundary. Algorithm kNN (k-nearest neighbors).
  • Classification algorithms: decision trees and ensembles.
    Decision tree. The process of generating a decision tree. The concept of tree node splitting and splitting criteria. Problems of using decision trees as a classification algorithm. The concept of ensemble and ensemble algorithms. Bagging and pasting. Bootstrap. Random forest. The concept of stacking. Boosting: gradient tree boosting and ADABoost algorithm.
  • Support vector machine.
    Support Vector Machine (SVM) algorithm. The concept of support vectors. The concept of a margin.Regularization in SVM. Nuclear method. RBF kernel. Polynomial kernels. Special types of cores.
  • Unsupervised machine learning tasks. Dimension reduction.
    The concept of "Curse of dimensionality". The need to reduce the dimension. The concept of principal components. Principal component analysis (PCA) and its relation to singular matrix decomposition (SVD). Algorithm LLE (Local Linear Embeddings) and tSNE. The principle of vector representations.
  • Unsupervised machine learning tasks. The task of clustering.
    The problem of cluster analysis. Ambiguity of cluster analysis. Methods of cluster analysis. The k-Means algorithm. Metrics: Euclidean, Manhattan, Minkowski. Application of cosine metric. Hierarchical clustering algorithms. Agglomerative and divisive hierarchical clustering. Methods for determining the optimal number of clusters. Clustering estimation methods.
Assessment Elements

Assessment Elements

  • non-blocking Laboratory work
  • non-blocking Control work
  • non-blocking Exam
    "Экзамен проводится в устной форме (опрос по материалам курса). Экзамен проводится на платформе MS Teams (https://teams.microsoft.com). К экзамену необходимо подключиться согласно расписанию ответов, высланному преподавателем на корпоративные почты студентов накануне экзамена. Компьютер студента должен удовлетворять требованиям: наличие рабочей камеры и микрофона, поддержка MS Teams. Для участия в экзамене студент обязан: поставить на аватар свою фотографию, явиться на экзамен согласно точному расписанию, при ответе включить камеру и микрофон. Во время экзамена студентам запрещено: выключать камеру, пользоваться конспектами и подсказками. Кратковременным нарушением связи во время экзамена считается нарушение связи до 5 минут. Долговременным нарушением связи во время экзамена считается нарушение 5 минут и более. При долговременном нарушении связи студент не может продолжить участие в экзамене. Процедура пересдачи аналогична процедуре сдачи."
Interim Assessment

Interim Assessment

  • Interim assessment (4 module)
    0.3 * Control work + 0.4 * Exam + 0.3 * Laboratory work
Bibliography

Bibliography

Recommended Core Bibliography

  • Muller, A. C., & Guido, S. (2017). Introduction to machine learning with Python: a guide for data scientists. O’Reilly Media. (HSE access: http://ebookcentral.proquest.com/lib/hselibrary-ebooks/detail.action?docID=4698164)

Recommended Additional Bibliography

  • Рашка С. - Python и машинное обучение: крайне необходимое пособие по новейшей предсказательной аналитике, обязательное для более глубокого понимания методологии машинного обучения - Издательство "ДМК Пресс" - 2017 - 418с. - ISBN: 978-5-97060-409-0 - Текст электронный // ЭБС ЛАНЬ - URL: https://e.lanbook.com/book/100905