**Ques: 1. What is the difference between supervised and unsupervised machine learning?**

Ans:
Supervised learning requires training labelled data. For example, in order to
do classification (a supervised learning task), you’ll need to first label the
data you’ll use to train the model to classify data into your labelled groups.
Unsupervised learning, in contrast, does not require labelling data explicitly.

**Ques: 2. What is Overfitting? And how do you ensure you’re not overfitting with a model?**

Ans:
Over-fitting occurs when a model studies the training data to such an extent
that it negatively influences the performance of the model on new data. This
means that the disturbance in the training data is recorded and learned as
concepts by the model. But the problem here is that these concepts do not apply
to the testing data and negatively impact the model’s ability to classify the
new data, hence reducing the accuracy on the testing data.

Collect
more data so that the model can be trained with varied samples. Use assembling
methods, such as Random Forest. It is based on the idea of bagging, which is
used to reduce the variation in the predictions by combining the result of
multiple Decision trees on different samples of the data set.

**Ques: 3. What do you understand by precision and recall?**

Ans:
Recall is also known as the true positive rate: the number of positives your
model claims compared to the actual number of positives there are throughout
the data. Precision is also known as the positive predictive value, and it is a
measure of the number of accurate positives your model claims compared to the
number of positives it actually claims. It can be easier to think of recall and
precision in the context of a case where you’ve predicted that there were 10
apples and 5 oranges in a case of 10 apples. You’d have perfect recall (there
are actually 10 apples, and you predicted there would be 10) but 66.7%
precision because out of the 15 events you predicted, only 10 (the apples) are
correct.

**Ques: 4. What are collinearity and multi collinearity?**

Ans:
Collinearity occurs when two predictor variables
(e.g., x1 and x2) in a multiple regression have some
correlation.

Multi collinearity
occurs when more than two predictor variables (e.g., x1, x2,
and x3) are inter-correlated.

**Ques: 5. What’s the difference between Type I and Type II error?**

Ans:
Don’t think that this is a trick question! Many machine learning interview
questions will be an attempt to lob basic questions at you just to make
sure you’re on top of your game and you’ve prepared all of your bases.

Type
I error is a false positive, while Type II error is a false negative. Briefly
stated, Type I error means claiming something has happened when it hasn’t,
while Type II error means that you claim nothing is happening when in fact
something is.

A
clever way to think about this is to think of Type I error as telling a man he
is pregnant, while Type II error means you tell a pregnant woman she isn’t
carrying a baby.

**Ques: 6. What is A/B Testing?**

Ans:
A/B is Statistical hypothesis testing for randomized experiment with two variables
A and B. It is used to compare two models that use different predictor
variables in order to check which variable fits best for a given sample of
data.

Consider
a scenario where you’ve created two models (using different predictor
variables) that can be used to recommend products for an e-commerce platform.

A/B
Testing can be used to compare these two models to check which one best
recommends products to a customer.

**Ques: 7. What is deep learning, and how does it contrast with other machine learning algorithms?**

Ans:
Deep learning is a subset of machine learning that is concerned with neural
networks: how to use back propagation and certain principles from neuroscience
to more accurately model large sets of unlabelled or semi-structured data. In
that sense, deep learning represents an unsupervised learning algorithm that
learns representations of data through the use of neural nets.

**Ques: 8. Name a few libraries in Python used for Data Analysis and Scientific Computations.**

Ans:
Here is a list of Python libraries mainly used for Data Analysis:

- NumPy
- SciPy
- Pandas
- SciKit
- Matplotlib
- Seaborn
- Bokeh

**Ques: 9. Which is more important to you– model accuracy, or model performance?**

Ans:
This question tests your grasp of the nuances of machine learning model
performance! Machine learning interview questions often look towards the
details. There are models with higher accuracy that can perform worse in
predictive power — how does that make sense?

Well,
it has everything to do with how model accuracy is only a subset of model
performance, and at that, a sometimes misleading one. For example, if you
wanted to detect fraud in a massive data set with a sample of millions, a more
accurate model would most likely predict no fraud at all if only a vast
minority of cases were fraud. However, this would be useless for a predictive
model — a model designed to find fraud that asserted there was no fraud at all!
Questions like this help you demonstrate that you understand model accuracy isn’t
the be-all and end-all of model performance.

**Ques: 10. How are NumPy and SciPy related?**

Ans:
NumPy is part of SciPy. NumPy defines arrays along with some basic numerical
functions like indexing, sorting, reshaping, etc.

SciPy
implements computations such as numerical integration, optimization and machine
learning using NumPy’s functionality.

**Ques: 11. How would you handle an imbalanced dataset?**

Ans:
An imbalanced dataset is when you have, for example, a classification test and
90% of the data is in one class. That leads to problems: an accuracy of 90% can
be skewed if you have no predictive power on the other category of data! Here
are a few tactics to get over the hump:

1.
Collect
more data to even the imbalances in the dataset.

2.
Re-sample
the dataset to correct for imbalances.

3.
Try
a different algorithm altogether on your dataset.

What’s
important here is that you have a keen sense for what damage an unbalanced
dataset can cause, and how to balance that.

**Ques: 12: Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?**

Ans:
Yes, rotation (orthogonal) is necessary because it maximizes the difference
between variance captured by the component. This makes the components easier to
interpret. Not to forget, that’s the motive of doing PCA where, we aim to
select fewer components (than features) which can explain the maximum variance
in the data set. By doing rotation, the relative location of the components
doesn’t change, it only changes the actual coordinates of the points.

If
we don’t rotate the components, the effect of PCA will diminish and we’ll have
to select more number of components to explain variance in the data set.

**Ques: 13. What’s the “kernel trick” and how is it useful?**

Ans:
The Kernel trick involves kernel functions that can enable in higher-dimension
spaces without explicitly calculating the coordinates of points within that
dimension: instead, kernel functions compute the inner products between the
images of all pairs of data in a feature space. This allows them the very
useful attribute of calculating the coordinates of higher dimensions while
being computationally cheaper than the explicit calculation of said
coordinates. Many algorithms can be expressed in terms of inner products.
Using the kernel trick enables us effectively run algorithms in a
high-dimensional space with lower-dimensional data.

**Ques: 14. Explain prior probability, likelihood and marginal likelihood in context of naiveBayes algorithm?**

Ans:
Prior probability is nothing but, the proportion of dependent (binary) variable
in the data set. It is the closest guess you can make about a class, without
any further information. For example: In a data set, the dependent variable is
binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%.
Hence, we can estimate that there are 70% chances that any new email would
be classified as spam.

Likelihood
is the probability of classifying a given observation as 1 in presence of some
other variable. For example: The probability that the word ‘FREE’ is used
in previous spam message is likelihood. Marginal likelihood is, the
probability that the word ‘FREE’ is used in any message.

**Ques: 15. Do you have experience with Spark or big data tools for machine learning?**

Ans: You’ll
want to get familiar with the meaning of big data for different companies and
the different tools they’ll want. Spark is the big data tool most in demand
now, able to handle immense datasets with speed. Be honest if you don’t have
experience with the tools demanded, but also take a look at job descriptions
and see what tools pop up: you’ll want to invest in familiarizing yourself with
them.

**Ques: 16: You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?**

Ans:
Low bias occurs when the model’s predicted values are near to actual values. In
other words, the model becomes flexible enough to mimic the training data distribution.
While it sounds like great achievement, but not to forget, a flexible
model has no generalization capabilities. It means, when this model is
tested on an unseen data, it gives disappointing results.

In
such situations, we can use bagging algorithm (like random forest) to tackle
high variance problem. Bagging algorithms divides a data set into subsets
made with repeated randomized sampling. Then, these samples are used to
generate a set of models using a single learning algorithm. Later, the
model predictions are combined using voting (classification) or averaging
(regression).

Also,
to combat high variance, we can:

Use
regularization technique, where higher model coefficients get penalized,
hence lowering model complexity.

Use
top n features from variable importance chart. May be, with all the variable in
the data set, the algorithm is having difficulty in finding the meaningful
signal.

**Ques 17. Which data visualization libraries do you use? What are your thoughts on the best data visualization tools?**

Ans:
What’s important here is to define your views on how to properly visualize data
and your personal preferences when it comes to tools. Popular tools include R’s
ggplot, Python’s seaborn and matplotlib, and tools such as Plot.ly and Tableau.

**Ques: 18. How is kNN different from kmeans clustering?**

Ans: Don’t
get mislead by ‘k’ in their names. You should know that the fundamental
difference between both these algorithms is, kmeans is unsupervised in nature
and kNN is supervised in nature. kmeans is a clustering algorithm. kNN is a
classification (or regression) algorithm.

kmeans
algorithm partitions a data set into clusters such that a cluster formed is
homogeneous and the points in each cluster are close to each other. The
algorithm tries to maintain enough separability between these clusters. Due to
unsupervised nature, the clusters have no labels.

KNN
algorithm tries to classify an unlabelled observation based on its k (can be
any number ) surrounding neighbors. It is also known as lazy learner because
it involves minimal training of model. Hence, it doesn’t use training data
to make generalization on unseen data set.

**Ques: 19. Is it better to have too many false positives or too many false negatives? Explain.**

Ans: It
depends on the question as well as on the domain for which we are trying to
solve the problem. If you’re using Machine Learning in the domain of medical
testing, then a false negative is very risky, since the report will not show
any health problem when a person is actually unwell. Similarly, if Machine
Learning is used in spam detection, then a false positive is very risky
because the algorithm may classify an important email as spam.

**Ques: 20. What is the difference between Gini Impurity and Entropy in a Decision Tree?**

Ans:
Gini Impurity and Entropy are the metrics used for deciding how to split a
Decision Tree.

Gini
measurement is the probability of a random sample being classified correctly if
you randomly pick a label according to the distribution in the branch.

Entropy
is a measurement to calculate the lack of information. You calculate the
Information Gain (difference in entropies) by making a split. This measure
helps to reduce the uncertainty about the output label.

## No comments:

## Post a comment