Ques: 1. What
is the difference between supervised and unsupervised machine learning?
Answer:
Supervised learning
requires training labelled data. For example, in order to do classification (a
supervised learning task), you’ll need to first label the data you’ll use to
train the model to classify data into your labelled groups. Unsupervised
learning, in contrast, does not require labelling data explicitly.
Ques: 2. What
is Overfitting? And how do you ensure you’re not overfitting with a model?
Answer:
Over-fitting occurs when a model studies the training data to such an extent
that it negatively influences the performance of the model on new data. This
means that the disturbance in the training data is recorded and learned as
concepts by the model. But the problem here is that these concepts do not apply
to the testing data and negatively impact the model’s ability to classify the
new data, hence reducing the accuracy on the testing data.
Collect more data so that
the model can be trained with varied samples. Use assembling methods, such as
Random Forest. It is based on the idea of bagging, which is used to reduce the
variation in the predictions by combining the result of multiple Decision trees
on different samples of the data set.
Ques: 3. What
do you understand by precision and recall?
Answer:
Recall is also known as the
true positive rate: the number of positives your model claims compared to the
actual number of positives there are throughout the data. Precision is also
known as the positive predictive value, and it is a measure of the number of
accurate positives your model claims compared to the number of positives it
actually claims. It can be easier to think of recall and precision in the
context of a case where you’ve predicted that there were 10 apples and 5
oranges in a case of 10 apples. You’d have perfect recall (there are actually
10 apples, and you predicted there would be 10) but 66.7% precision because out
of the 15 events you predicted, only 10 (the apples) are correct.
Ques: 4. What
are collinearity and multi collinearity?
Answer:
Collinearity occurs when
two predictor variables (e.g., x1 and x2) in a multiple
regression have some correlation.
Multi collinearity occurs
when more than two predictor variables (e.g., x1, x2, and x3)
are inter-correlated.
Ques: 5. What’s
the difference between Type I and Type II error?
Answer:
Don’t think that this is a
trick question! Many machine learning interview questions will be an attempt
to lob basic questions at you just to make sure you’re on top of your game
and you’ve prepared all of your bases.
Type I error is a false
positive, while Type II error is a false negative. Briefly stated, Type I error
means claiming something has happened when it hasn’t, while Type II error means
that you claim nothing is happening when in fact something is.
A clever way to think about
this is to think of Type I error as telling a man he is pregnant, while Type II
error means you tell a pregnant woman she isn’t carrying a baby.
Ques: 6. What
is A/B Testing?
Answer:
A/B is Statistical
hypothesis testing for randomized experiment with two variables A and B. It is
used to compare two models that use different predictor variables in order to
check which variable fits best for a given sample of data.
Consider a scenario where
you’ve created two models (using different predictor variables) that can be used
to recommend products for an e-commerce platform.
A/B Testing can be used to
compare these two models to check which one best recommends products to a
customer.
Ques: 7. What
is deep learning, and how does it contrast with other machine learning algorithms?
Answer:
Deep learning is a subset
of machine learning that is concerned with neural networks: how to use back
propagation and certain principles from neuroscience to more accurately model
large sets of unlabelled or semi-structured data. In that sense, deep learning
represents an unsupervised learning algorithm that learns representations of
data through the use of neural nets.
Ques: 8. Name a
few libraries in Python used for Data Analysis and Scientific Computations.
Answer:
Here is a list of Python
libraries mainly used for Data Analysis:
- NumPy
- SciPy
- Pandas
- SciKit
- Matplotlib
- Seaborn
- Bokeh
Ques: 9. Which
is more important to you– model accuracy, or model performance?
Answer:
This question tests your
grasp of the nuances of machine learning model performance! Machine learning
interview questions often look towards the details. There are models with
higher accuracy that can perform worse in predictive power — how does that make
sense?
Well, it has everything to
do with how model accuracy is only a subset of model performance, and at that,
a sometimes misleading one. For example, if you wanted to detect fraud in a
massive data set with a sample of millions, a more accurate model would most
likely predict no fraud at all if only a vast minority of cases were fraud.
However, this would be useless for a predictive model — a model designed to
find fraud that asserted there was no fraud at all! Questions like this help
you demonstrate that you understand model accuracy isn’t the be-all and
end-all of model performance.
Ques: 10. How are NumPy and SciPy related?
Answer:
NumPy is part of SciPy.
NumPy defines arrays along with some basic numerical functions like indexing,
sorting, reshaping, etc.
SciPy implements
computations such as numerical integration, optimization and machine learning
using NumPy’s functionality.
Ques: 11. How would you handle an imbalanced dataset?
Answer:
An imbalanced dataset is when you have, for example, a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data! Here are a few tactics to get over the hump:
- Collect more data to even the imbalances in the dataset.
- Re-sample the dataset to correct for imbalances.
- Try a different algorithm altogether on your dataset.
What’s important here is that you have a keen sense for what damage an unbalanced dataset can cause, and how to balance that.
Ques: 12: Is
rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate
the components?
Answer:
Yes, rotation (orthogonal)
is necessary because it maximizes the difference between variance captured by
the component. This makes the components easier to interpret. Not to forget,
that’s the motive of doing PCA where, we aim to select fewer components (than
features) which can explain the maximum variance in the data set. By doing
rotation, the relative location of the components doesn’t change, it only
changes the actual coordinates of the points.
If we don’t rotate the
components, the effect of PCA will diminish and we’ll have to select more
number of components to explain variance in the data set.
Ques: 13.
What’s the “kernel trick” and how is it useful?
Answer:
The Kernel trick involves
kernel functions that can enable in higher-dimension spaces without explicitly
calculating the coordinates of points within that dimension: instead, kernel
functions compute the inner products between the images of all pairs of data in
a feature space. This allows them the very useful attribute of calculating the
coordinates of higher dimensions while being computationally cheaper than the
explicit calculation of said coordinates. Many algorithms can be expressed
in terms of inner products. Using the kernel trick enables us effectively run
algorithms in a high-dimensional space with lower-dimensional data.
Ques: 14.
Explain prior probability, likelihood and marginal likelihood in context of
naiveBayes algorithm?
Answer:
Prior probability is
nothing but, the proportion of dependent (binary) variable in the data set. It
is the closest guess you can make about a class, without any further
information. For example: In a data set, the dependent variable is binary
(1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we
can estimate that there are 70% chances that any new email would be
classified as spam.
Likelihood is the
probability of classifying a given observation as 1 in presence of some other
variable. For example: The probability that the word ‘FREE’ is used in
previous spam message is likelihood. Marginal likelihood is, the
probability that the word ‘FREE’ is used in any message.
Ques: 15. Do
you have experience with Spark or big data tools for machine learning?
Answer:
You’ll want to get familiar
with the meaning of big data for different companies and the different tools
they’ll want. Spark is the big data tool most in demand now, able to handle
immense datasets with speed. Be honest if you don’t have experience with the
tools demanded, but also take a look at job descriptions and see what tools pop
up: you’ll want to invest in familiarizing yourself with them.
Ques: 16: You
came to know that your model is suffering from low bias and high variance.
Which algorithm should you use to tackle it? Why?
Answer:
Low bias occurs when the
model’s predicted values are near to actual values. In other words, the model
becomes flexible enough to mimic the training data distribution. While it
sounds like great achievement, but not to forget, a flexible model has
no generalization capabilities. It means, when this model is tested on an
unseen data, it gives disappointing results.
In such situations, we can
use bagging algorithm (like random forest) to tackle high variance problem.
Bagging algorithms divides a data set into subsets made with repeated
randomized sampling. Then, these samples are used to generate a set of
models using a single learning algorithm. Later, the model predictions are
combined using voting (classification) or averaging (regression).
Also, to combat high
variance, we can:
Use regularization
technique, where higher model coefficients get penalized, hence lowering
model complexity.
Use top n features from
variable importance chart. May be, with all the variable in the data set, the
algorithm is having difficulty in finding the meaningful signal.
Ques 17. Which data visualization libraries do you use? What are your thoughts on the best data visualization tools?
Answer:
What’s important here is to
define your views on how to properly visualize data and your personal
preferences when it comes to tools. Popular tools include R’s ggplot, Python’s
seaborn and matplotlib, and tools such as Plot.ly and Tableau.
Ques: 18. How
is kNN different from kmeans clustering?
Answer:
Don’t get mislead by ‘k’ in their names. You should know that the fundamental difference between both these algorithms is,
- kmeans is unsupervised in nature and kNN is supervised in nature.
- kmeans is a clustering algorithm. kNN is a classification (or regression) algorithm.
- kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other.
The algorithm tries to maintain enough separability between these clusters. Due to unsupervised nature, the clusters have no labels.
KNN algorithm tries to classify an unlabelled observation based on its k (can be any number ) surrounding neighbors. It is also known as lazy learner because it involves minimal training of model. Hence, it doesn’t use training data to make generalization on unseen data set.
Ques: 19. Is it
better to have too many false positives or too many false negatives? Explain.
Answer:
It depends on the question
as well as on the domain for which we are trying to solve the problem. If
you’re using Machine Learning in the domain of medical testing, then a false
negative is very risky, since the report will not show any health problem when
a person is actually unwell. Similarly, if Machine Learning is used in spam
detection, then a false positive is very risky because the algorithm may
classify an important email as spam.
Ques: 20. What
is the difference between Gini Impurity and Entropy in a Decision Tree?
Answer:
Gini Impurity and Entropy
are the metrics used for deciding how to split a Decision Tree.
Gini measurement is the
probability of a random sample being classified correctly if you randomly pick
a label according to the distribution in the branch.
Entropy is a measurement to calculate the lack of information. You calculate the Information Gain (difference in entropies) by making a split. This measure helps to reduce the uncertainty about the output label.