** **

**Ques: 1. What is the difference between Data Science and Data Analytics?**

**Answer: **

Data Scientists need to
slice data to extract valuable insights that a data analyst can apply to
real-world business scenarios. The main difference between the two is that the
data scientists have more technical knowledge then business analyst. Moreover,
they don’t need an understanding of the business required for data
visualization.

**Ques: 2. What
is the method to collect and analyse data to use social media to predict
the weather condition?**

**Answer: **

You can collect social
media data using Facebook, twitter, Instagram's API's. For example, for the
tweeter, we can construct a feature from each tweet like tweeted date,
retweets, list of followers, etc. Then you can use a multivariate time series
model to predict the weather condition.

**Ques: 3. What
is the Cross-Validation?**

**Answer:**

It is a model validation
technique for evaluating how the outcomes of a statistical analysis will
generalize to an independent data set. It is mainly used in backgrounds where
the objective is forecast, and one wants to estimate how accurately a model
will accomplish in practice. The goal of cross-validation is to term a data set
to test the model in the training phase (i.e., validation data set) to limit
problems like overfitting and gain insight on how the model will
generalize to an independent data set.

**Ques: 4. What
are the Steps in Making a “Decision Tree”?**

**Answer: **

The steps to make a
“Decision Tree” are as follows:

- Take the entire data set as input.
- Look for a split that maximizes
the separation of the classes. A split is any test that divides the data
into two sets.
- Apply the split to the input data
(divide step).
- Re-apply steps 1 to 2 to the
divided data.
- Stop when you meet some stopping
criteria.This step is called pruning.
- Clean up the tree if you went too
far doing splits.

**Ques: 5. Can
you explain Star Schema?**

**Answer: **

It is a traditional
database schema with a central table. Satellite tables map IDs to physical
names or descriptions and can be connected to the central fact table using the
ID fields; these tables are known as lookup tables and are principally useful
in real-time applications, as they save a lot of memory. Sometimes star schemas
involve several layers of summarization to recover information faster.

**Ques: 6. What
are the various steps for a Data analytics project?**

**Answer: **

The following are important
steps involved in an analytics project:

- Understand the Business
problem.
- Explore the data and study it
carefully.
- Prepare the data for modelling by
finding missing values and transforming variables.
- Start running the model and
analyse the Big data result.
- Validate the model with new data
set.
- Implement the model and track the
result to analyze the performance of the model for a specific period.

**Ques: 7. Why
Data Cleansing is essential and which method you use to maintain clean data?
Explain.**

**Answer: **

Dirty data often leads to
the incorrect inside, which can damage the prospect of any organization. For
example, if you want to run a targeted marketing campaign. However, our data
incorrectly tell you that a specific product will be in-demand with your target
audience; the campaign will fail.

**Ques: 8. What
is reinforcement learning?**

**Answer:**

Reinforcement Learning is a
learning mechanism about how to map situations to actions. The end result
should help you to increase the binary reward signal. In this method, a learner
is not told which action to take but instead must discover which action offers
a maximum reward. As this method based on the reward/penalty mechanism.

**Ques: 9. While
working on a data set, how can you select important variables? Explain.**

**Answer:**

Following methods of
variable selection you can use:

- Remove the correlated variables
before selecting important variables
- Use linear regression and select
variables which depend on that p values.
- Use Backward, Forward Selection,
and Stepwise Selection
- Use Xgboost, Random Forest, and
plot variable importance chart.
- Measure information gain for the
given set of features and select top n features accordingly.

**Ques: 10. What
cross-validation technique would you use on a time series dataset?**

**Answer:**

Instead of using k-fold
cross-validation, you should be aware to the fact that a time series is not
randomly distributed data - It is inherently ordered by chronological order.

In case of time series
data, you should use techniques like forward chaining – Where you will be model
on past data then look at forward-facing data.

fold 1: training[1], test[2]

fold 1: training[1 2],
test[3]

fold 1: training[1 2 3],
test[4]

fold 1: training[1 2 3 4],
test[5]

**Ques: 11. What
is deep learning?**

**Answer:**

Deep learning is subfield
of machine learning inspired by structure and function of brain called
artificial neural network. We have a lot of numbers of algorithms under machine
learning like Linear regression, SVM, Neural network etc and deep learning is just
an extension of Neural networks. In neural nets we consider small number of
hidden layers but when it comes to deep learning algorithms we consider a huge
number of hidden layers to better understand the input output relationship.

**Ques: 12. What
is the difference between machine learning and deep learning?**

**Answer:**

Machine learning:

Machine learning is a field
of computer science that gives computers the ability to learn without being
explicitly programmed. Machine learning can be categorized in following three
categories.

- Supervised machine learning,
- Unsupervised machine learning,
- Reinforcement learning

Deep learning:

Deep Learning is a subfield
of machine learning concerned with algorithms inspired by the structure and
function of the brain called artificial neural networks.

**Ques: 13. What
is selection bias?**

**Answer: **

Selection bias is the bias
introduced by the selection of individuals, groups or data for analysis in such
a way that proper randomization is not achieved, thereby ensuring that the sample
obtained is not representative of the population intended to be analysed. It is
sometimes referred to as the selection effect. The phrase “selection bias” most
often refers to the distortion of a statistical analysis, resulting from the
method of collecting samples. If the selection bias is not considered, then
some conclusions of the study may not be accurate.

**Ques: 14. What
is TF/IDF vectorization?**

**Answer:**

TF–IDF is short for term
frequency–inverse document frequency, is a numerical statistic that is intended
to reflect how important a word is to a document in a collection or corpus. It
is often used as a weighting factor in information retrieval and text mining.
The TF-IDF value increases proportionally to the number of times a word appears
in the document, but is offset by the frequency of the word in the corpus,
which helps to adjust for the fact that some words appear more frequently in
general.

**Ques: 15. What
is the difference between Regression and classification ML techniques?**

**Answer:**

Both Regression and
classification machine learning techniques come under Supervised machine
learning algorithms. In Supervised machine learning algorithm, we must train
the model using labelled dataset, while training we must explicitly provide the
correct labels and algorithm tries to learn the pattern from input to output.
If our labels are discreate values then it will a classification problem, e.g
A,B etc. but if our labels are continuous values then it will be a regression
problem, e.g 1.23, 1.333 etc.

**Ques: 16. What
is p-value?**

**Answer:**

When you perform a
hypothesis test in statistics, a p-value can help you determine the strength of
your results. p-value is a number between 0 and 1. Based on the value it will
denote the strength of the results. The claim which is on trial is called Null
Hypothesis.

Low p-value (≤ 0.05)
indicates strength against the null hypothesis which means we can reject the
null Hypothesis. High p-value (≥ 0.05) indicates strength for the null
hypothesis which means we can accept the null Hypothesis p-value of 0.05
indicates the Hypothesis could go either way. To put it in another way,

High P values: your data are likely with a true null.

Low P values: your data are unlikely with a true null.

**Ques: 17. What
are the differences between overfitting and underfitting?**

**Answer:**

In order to make reliable
predictions on general untrained data in machine learning and statistics, it is
required to fit a (machine learning) model to a set of training data.
Overfitting and underfitting are two of the most common modelling errors that
occur while doing so.

Following are the differences between overfitting and underfitting:

Definition - A statistical model suffering from overfitting describes some random error or noise in place of the underlying relationship. When underfitting occurs, a statistical model or machine learning algorithm fails in capturing the underlying trend of the data.

Occurrence – When a statistical model or machine learning algorithm is excessively complex, it can result in overfitting. Example of a complex model is one having too many parameters when compared to the total number of observations. Underfitting occurs when trying to fit a linear model to non-linear data.

Poor Predictive Performance – Although both overfitting and underfitting yield poor predictive performance, the way in which each one of them does so is different. While the overfitted model overreacts to minor fluctuations in the training data, the underfit model under-reacts to even bigger fluctuations.

**Ques: 18. Could
you explain the role of data cleaning in data analysis?**

**Answer:**

Data cleaning can be a
daunting task since with the increase in the number of data sources, the time
required for cleaning the data increases at an exponential rate.

This is due to the vast
volume of data generated by additional sources. Also, data cleaning can solely
take up to 80% of the total time required for carrying out a data analysis
task.

Nevertheless, there are
several reasons for using data cleaning in data analysis. Two of the most
important ones are:

- Cleaning data from different
sources helps in transforming the data into a format that is easy to work
with.
- Data cleaning increases the
accuracy of a machine learning model.

**Ques: 19. Can
you explain Recommender Systems along with an application?**

**Answer:**

Recommender Systems is a
subclass of information filtering systems, meant for predicting the preferences
or ratings awarded by a user to some product.

An application of a
recommender system is the product recommendations section in Amazon. This
section contains items based on the user’s search history and past orders.

**Ques: 20. What
is exploding gradients?**

**Answer:**

“Exploding gradients are a
problem where large error gradients accumulate and result in very large updates
to neural network model weights during training.” At an extreme, the values of
weights can become so large as to overflow and result in NaN values.

This has the effect of your
model being unstable and unable to learn from your training data.

Gradient: Gradient is the direction and magnitude calculated during training of a neural network that is used to update the network weights in the right direction and by the right amount.