## Top Machine Learning Questions for Data Scientists to Know the Answer to

## These questions you must know about if you are a data scientist

Machine learning-related questions always take a large portion during interviews. Positions like data scientists, machine learning engineers require potential candidates to have comprehensive understandings of machine learning models and be familiar with conducting analysis using these models. While discussing some of your projects with the interviewer demonstrate your understanding of certain models, it is expected that interviewers will ask some fundamental machine learning questions about model selection, feature selection, feature engineering, model evaluation, etc.

**What are the supervised machine learning problems, and what are unsupervised machine learning problems?**

You can easily distinguish them by checking whether there are target values, or labels, to predict the problem. Supervised machine learning maps data with target values so that the model will use features extracted from data to predict target values. For example, using linear regression to predict the housing prices; using logistic regression to predict whether one person will default on his/her debts. Unsupervised machine learning problems have no target value to predict but are learning to uncover the general patterns from the data. For example, clustering the data based on the pattern; dimension reduction based on the feature variances.

**What is a classification problem, and what is a regression problem?**

Both classification problems and regression problems are supervised machine learning problems so that there are target values in the problems. The classification problems have discrete target values that stand for classes. For the binary classification problem, there are only positive classes and negative classes. The regression problems have continuous target values to predict, like housing prices, waiting time, etc.

**What are the parameters and hyper-parameters for a machine learning model?**

Parameters are generated during the fitting process of the model, while hyper-parameters are defined by default or specified by searching through GridSearchCV. Take ridge regression as an example, parameters are the coefficients for all features while the hyper-parameter is the α that specifies the level of regularization in the model.

**What is the cost function of logistic regression?**

Cost function with cross-entropy simultaneously penalizing uncertainty and incorrect predictions. Incorrect predictions that are made with high confidence contribute the largest penalties to the sum. For example, when y_j = 0 and your model predicts f (X_j) = 0.9. The cost will be -log(0.1), which is close to infinite.

**What is SVM, and what is the support vector?**

Support Vector Machine (SVM) is a supervised machine learning algorithm that is usually used in solving binary classification problems. It can also be applied in multi-class classification problems and regression problems. The support vectors are the data points that lie closest to the separating hyperplane. They are the most difficult data points to classify. Moreover, support vectors are the elements of the training set that would change the position of the dividing hyperplane if removed.

**What are Gradient Descent and Stochastic Gradient Descent?**

Each machine learning model has a cost function J (θ_0, θ_1,…θ_n), where θs are the parameters. To find the optimal parameters during fitting, we are solving an optimization problem:

min J (θ_0, θ_1,…θ_n)

w.r.t θ_0, θ_1,…θ_n

Gradient Descent solves this problem by taking first-order iterations: It starts with random values of θs and keeps updating θs based on the first-order partial derivatives. When the partial derivative is positive, we decrease θ and Vice Versa: When the partial derivative reaches zero or close enough to zero, the iteration stops and reaches the local/global minimum. ɳ is the learning rate, when it is small, it takes longer to converge, but if it is big, the cost function may not decrease at every iteration and may diverge in some cases. Stochastic Gradient Descent is an optimization method that considers each training observation individually, instead of all at once (as normal gradient descent would). Instead of calculating the exact gradient of the cost function, it uses each observation to estimate the gradient and then takes a step in that direction. While each observation will provide a poor estimate of the true gradient, given enough randomness the parameters will converge to a good global estimate. Because it need only consider a single observation at a time, stochastic gradient descent can handle data sets too large to fit in memory.

**How to choose K for K-means?**

We choose the number of clusters to define in the K-means algorithm beforehand, and the K value is determined both technically and practically. First, we need to plot the Elbow curve that measures distortion (average of the squared distances from the cluster center) or inertia (sum of squared distances of samples to their closest cluster center) concerning K. Note that we will always decrease distortion and inertia as K increase, and if K equals to the number of data points, then their value will be zero. We can use the Elbow curve to check the decreasing speed and choose the K at the “Elbow point” when the value decreases substantially slower. Practically speaking, we need to choose the K that is either easier to interpret, or practically doable.

**What is online learning?**

Online learning is updating a fit with new data, rather than re-fitting the whole model. It is usually applied in two scenarios. One is that when your data is coming sequentially, and you want to adjust your model incrementally to accommodate the new data. Another case is that when your data is too large to train on all at once, you can either use Stochastic Gradient Descent or specifying batch sizes depending on the model you are using.

**What is the difference between under-fitting and over-fitting?**

Underfitting is when your model is not complex enough to learn the data patterns, and over-fitting is when your model is too complicated and is picking up the noises rather than the patterns. When underfitting, your model will have poor performances in both the training set and testing set, and you need to include more features, or using a more complicated model. When over-fitting. the model will perform very well at the training set, but it will not be generalizable to new data, which means it will perform badly in the test set. You need to use a simpler model or delete some features through regularization, bagging, or dropout.

**What is the trade-off between bias and variance?**

Bias is measuring how poorly your model performs thus it is a measure of underfitting. Variance is a measure of over-fitting, which is measuring how much your model has fit the noise in the data.