December 21, 2019

Top 20 Machine Learning Interview Questions and Answers

Ques: 1. What is the difference between supervised and unsupervised machine learning?


Supervised learning requires training labelled data. For example, in order to do classification (a supervised learning task), you’ll need to first label the data you’ll use to train the model to classify data into your labelled groups. Unsupervised learning, in contrast, does not require labelling data explicitly.


Ques: 2. What is Overfitting? And how do you ensure you’re not overfitting with a model?


Over-fitting occurs when a model studies the training data to such an extent that it negatively influences the performance of the model on new data. This means that the disturbance in the training data is recorded and learned as concepts by the model. But the problem here is that these concepts do not apply to the testing data and negatively impact the model’s ability to classify the new data, hence reducing the accuracy on the testing data.

Collect more data so that the model can be trained with varied samples. Use assembling methods, such as Random Forest. It is based on the idea of bagging, which is used to reduce the variation in the predictions by combining the result of multiple Decision trees on different samples of the data set.


Ques: 3. What do you understand by precision and recall?


Recall is also known as the true positive rate: the number of positives your model claims compared to the actual number of positives there are throughout the data. Precision is also known as the positive predictive value, and it is a measure of the number of accurate positives your model claims compared to the number of positives it actually claims. It can be easier to think of recall and precision in the context of a case where you’ve predicted that there were 10 apples and 5 oranges in a case of 10 apples. You’d have perfect recall (there are actually 10 apples, and you predicted there would be 10) but 66.7% precision because out of the 15 events you predicted, only 10 (the apples) are correct.


Ques: 4. What are collinearity and multi collinearity?


Collinearity occurs when two predictor variables (e.g., x1 and x2) in a multiple regression have some correlation.

Multi collinearity occurs when more than two predictor variables (e.g., x1, x2, and x3) are inter-correlated.


Ques: 5. What’s the difference between Type I and Type II error?


Don’t think that this is a trick question! Many machine learning interview questions will be an attempt to lob basic questions at you just to make sure you’re on top of your game and you’ve prepared all of your bases.

Type I error is a false positive, while Type II error is a false negative. Briefly stated, Type I error means claiming something has happened when it hasn’t, while Type II error means that you claim nothing is happening when in fact something is.

A clever way to think about this is to think of Type I error as telling a man he is pregnant, while Type II error means you tell a pregnant woman she isn’t carrying a baby.


Ques: 6. What is A/B Testing?


A/B is Statistical hypothesis testing for randomized experiment with two variables A and B. It is used to compare two models that use different predictor variables in order to check which variable fits best for a given sample of data.

Consider a scenario where you’ve created two models (using different predictor variables) that can be used to recommend products for an e-commerce platform.

A/B Testing can be used to compare these two models to check which one best recommends products to a customer.


Ques: 7. What is deep learning, and how does it contrast with other machine learning algorithms?


Deep learning is a subset of machine learning that is concerned with neural networks: how to use back propagation and certain principles from neuroscience to more accurately model large sets of unlabelled or semi-structured data. In that sense, deep learning represents an unsupervised learning algorithm that learns representations of data through the use of neural nets.


Ques: 8. Name a few libraries in Python used for Data Analysis and Scientific Computations.


Here is a list of Python libraries mainly used for Data Analysis: 

  • NumPy
  • SciPy
  • Pandas 
  • SciKit
  • Matplotlib 
  • Seaborn 
  • Bokeh


Ques: 9. Which is more important to you– model accuracy, or model performance?


This question tests your grasp of the nuances of machine learning model performance! Machine learning interview questions often look towards the details. There are models with higher accuracy that can perform worse in predictive power — how does that make sense?

Well, it has everything to do with how model accuracy is only a subset of model performance, and at that, a sometimes misleading one. For example, if you wanted to detect fraud in a massive data set with a sample of millions, a more accurate model would most likely predict no fraud at all if only a vast minority of cases were fraud. However, this would be useless for a predictive model — a model designed to find fraud that asserted there was no fraud at all! Questions like this help you demonstrate that you understand model accuracy isn’t the be-all and end-all of model performance.


Ques: 10. How are NumPy and SciPy related?


NumPy is part of SciPy. NumPy defines arrays along with some basic numerical functions like indexing, sorting, reshaping, etc.

SciPy implements computations such as numerical integration, optimization and machine learning using NumPy’s functionality.


Ques: 11. How would you handle an imbalanced dataset?


An imbalanced dataset is when you have, for example, a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data! Here are a few tactics to get over the hump:

  1. Collect more data to even the imbalances in the dataset.
  2. Re-sample the dataset to correct for imbalances.
  3. Try a different algorithm altogether on your dataset.

What’s important here is that you have a keen sense for what damage an unbalanced dataset can cause, and how to balance that.


Ques: 12: Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?


Yes, rotation (orthogonal) is necessary because it maximizes the difference between variance captured by the component. This makes the components easier to interpret. Not to forget, that’s the motive of doing PCA where, we aim to select fewer components (than features) which can explain the maximum variance in the data set. By doing rotation, the relative location of the components doesn’t change, it only changes the actual coordinates of the points.

If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select more number of components to explain variance in the data set.


Ques: 13. What’s the “kernel trick” and how is it useful?


The Kernel trick involves kernel functions that can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space. This allows them the very useful attribute of calculating the coordinates of higher dimensions while being computationally cheaper than the explicit calculation of said coordinates. Many algorithms can be expressed in terms of inner products. Using the kernel trick enables us effectively run algorithms in a high-dimensional space with lower-dimensional data.


Ques: 14. Explain prior probability, likelihood and marginal likelihood in context of naiveBayes algorithm?


Prior probability is nothing but, the proportion of dependent (binary) variable in the data set. It is the closest guess you can make about a class, without any further information. For example: In a data set, the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there are 70% chances that any new email would  be classified as spam.

Likelihood is the probability of classifying a given observation as 1 in presence of some other variable. For example: The probability that the word ‘FREE’ is used in previous spam message is likelihood. Marginal likelihood is, the probability that the word ‘FREE’ is used in any message.


Ques: 15. Do you have experience with Spark or big data tools for machine learning?


You’ll want to get familiar with the meaning of big data for different companies and the different tools they’ll want. Spark is the big data tool most in demand now, able to handle immense datasets with speed. Be honest if you don’t have experience with the tools demanded, but also take a look at job descriptions and see what tools pop up: you’ll want to invest in familiarizing yourself with them.


Ques: 16: You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?


Low bias occurs when the model’s predicted values are near to actual values. In other words, the model becomes flexible enough to mimic the training data distribution. While it sounds like great achievement, but not to forget, a flexible model has no generalization capabilities. It means, when this model is tested on an unseen data, it gives disappointing results.

In such situations, we can use bagging algorithm (like random forest) to tackle high variance problem. Bagging algorithms divides a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate  a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression).

Also, to combat high variance, we can:

Use regularization technique, where higher model coefficients get penalized, hence lowering model complexity.

Use top n features from variable importance chart. May be, with all the variable in the data set, the algorithm is having difficulty in finding the meaningful signal.


Ques 17. Which data visualization libraries do you use? What are your thoughts on the best data visualization tools?


What’s important here is to define your views on how to properly visualize data and your personal preferences when it comes to tools. Popular tools include R’s ggplot, Python’s seaborn and matplotlib, and tools such as and Tableau.


Ques: 18. How is kNN different from kmeans clustering?


Don’t get mislead by ‘k’ in their names. You should know that the fundamental difference between both these algorithms is,

  • kmeans is unsupervised in nature and kNN is supervised in nature.
  • kmeans is a clustering algorithm. kNN is a classification (or regression) algorithm.
  • kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other.

The algorithm tries to maintain enough separability between these clusters. Due to unsupervised nature, the clusters have no labels.

KNN algorithm tries to classify an unlabelled observation based on its k (can be any number ) surrounding neighbors. It is also known as lazy learner because it involves minimal training of model. Hence, it doesn’t use training data to make generalization on unseen data set.


Ques: 19. Is it better to have too many false positives or too many false negatives? Explain.


It depends on the question as well as on the domain for which we are trying to solve the problem. If you’re using Machine Learning in the domain of medical testing, then a false negative is very risky, since the report will not show any health problem when a person is actually unwell. Similarly, if Machine Learning is used in spam detection, then a false positive is very risky because the algorithm may classify an important email as spam.


Ques: 20. What is the difference between Gini Impurity and Entropy in a Decision Tree?


Gini Impurity and Entropy are the metrics used for deciding how to split a Decision Tree.

Gini measurement is the probability of a random sample being classified correctly if you randomly pick a label according to the distribution in the branch.

Entropy is a measurement to calculate the lack of information. You calculate the Information Gain (difference in entropies) by making a split. This measure helps to reduce the uncertainty about the output label.




No comments:

Post a Comment