Course Syllabus
Introduction: What is Data Science? Big Data and Data Science – Datafication – Current landscape of perspectives – Skill sets needed; Matrices – Matrices to represent relations between data, and necessary linear algebraic operations on matrices -Approximately representing matrices by decompositions (SVD and PCA); Statistics: Descriptive Statistics: distributions and probability – Statistical Inference: Populations and samples – Statistical modeling – probability distributions – fitting a model – Hypothesis Testing – Intro to R/ Python.
Data preprocessing: Data cleaning – data integration – Data Reduction Data Transformation and Data Discretization.Evaluation of classification methods – Confusion matrix, Students T-tests and ROC curves-Exploratory Data Analysis – Basic tools (plots, graphs and summary statistics) of EDA, Philosophy of EDA – The Data Science Process.
Basic Machine Learning Algorithms: Association Rule mining – Linear Regression- Logistic Regression – Classifiers – k-Nearest Neighbors (k-NN), k-means -Decision tree – Naive Bayes- Ensemble Methods – Random Forest. Feature Generation and Feature Selection – Feature Selection algorithms – Filters; Wrappers; Decision Trees; Random Forests.
Clustering: Choosing distance metrics – Different clustering approaches – hierarchical agglomerative clustering, k-means (Lloyd’s algorithm), – DBSCAN – Relative merits of each method – clustering tendency and quality.
Data Visualization: Basic principles, ideas and tools for data visualization.