Ques:
1. What is the difference between Data Science and Data Analytics?
Ans: Data
Scientists need to slice data to extract valuable insights that a data analyst
can apply to real-world business scenarios. The main difference between the two
is that the data scientists have more technical knowledge then business
analyst. Moreover, they don’t need an understanding of the business required
for data visualization.
Ques:
2. What is the method to collect and analyse data to use social media to
predict the weather condition?
Ans:
You can collect social media data using Facebook, twitter, Instagram's
API's. For example, for the tweeter, we can construct a feature from each tweet
like tweeted date, retweets, list of followers, etc. Then you can use a
multivariate time series model to predict the weather condition.
Ques:
3. What is the Cross-Validation?
Ans: It
is a model validation technique for evaluating how the outcomes of a
statistical analysis will generalize to an independent data set. It is mainly
used in backgrounds where the objective is forecast, and one wants to estimate
how accurately a model will accomplish in practice. The goal of
cross-validation is to term a data set to test the model in the training phase
(i.e., validation data set) to limit problems like overfitting and gain
insight on how the model will generalize to an independent data set.
Ques: 4. What are the Steps in Making a “Decision Tree”?
Ans: The steps to
make a “Decision Tree” are as follows:
- Take the entire data set as input.
- Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets.
- Apply the split to the input data (divide step).
- Re-apply steps 1 to 2 to the divided data.
- Stop when you meet some stopping criteria.This step is called pruning.
- Clean up the tree if you went too far doing splits.
Ques: 5. Can you explain Star Schema?
Ans: It is a
traditional database schema with a central table. Satellite tables map IDs to
physical names or descriptions and can be connected to the central fact table
using the ID fields; these tables are known as lookup tables and are
principally useful in real-time applications, as they save a lot of memory.
Sometimes star schemas involve several layers of summarization to recover
information faster.
Ques: 6. What are the
various steps for a Data analytics project?
Ans:
The following are important steps involved in an analytics project:
- Understand the Business problem.
- Explore the data and study it carefully.
- Prepare the data for modelling by finding missing values and transforming variables.
- Start running the model and analyse the Big data result.
- Validate the model with new data set.
- Implement the model and track the result to analyze the performance of the model for a specific period.
Ques: 7. Why Data
Cleansing is essential and which method you use to maintain clean data?
Explain.
Ans:
Dirty data often leads to the incorrect inside, which can damage the
prospect of any organization. For example, if you want to run a targeted
marketing campaign. However, our data incorrectly tell you that a specific
product will be in-demand with your target audience; the campaign will fail.
Ques:
8. What is reinforcement learning?
Ans: Reinforcement
Learning is a learning mechanism about how to map situations to actions. The
end result should help you to increase the binary reward signal. In this
method, a learner is not told which action to take but instead must discover
which action offers a maximum reward. As this method based on the
reward/penalty mechanism.
Ques: 9. While
working on a data set, how can you select important variables? Explain.
Ans: Following
methods of variable selection you can use:
- Remove the correlated variables before selecting important variables
- Use linear regression and select variables which depend on that p values.
- Use Backward, Forward Selection, and Stepwise Selection
- Use Xgboost, Random Forest, and plot variable importance chart.
- Measure information gain for the given set of features and select top n features accordingly.
Ques: 10. What
cross-validation technique would you use on a time series dataset?
Ans: Instead of
using k-fold cross-validation, you should be aware to the fact that a time
series is not randomly distributed data - It is inherently ordered by
chronological order.
In case of time series data, you should use techniques like
forward chaining – Where you will be model on past data then look at
forward-facing data.
fold 1: training[1], test[2]
fold 1: training[1 2], test[3]
fold 1: training[1 2 3], test[4]
fold 1: training[1 2 3 4], test[5]
Ques: 11. What is
deep learning?
Ans: Deep
learning is subfield of machine learning inspired by structure and function of
brain called artificial neural network. We have a lot of numbers of algorithms
under machine learning like Linear regression, SVM, Neural network etc and deep
learning is just an extension of Neural networks. In neural nets we consider
small number of hidden layers but when it comes to deep learning algorithms we
consider a huge number of hidden layers to better understand the input output
relationship.
Ques: 12. What is the
difference between machine learning and deep learning?
Ans: Machine
learning:
Machine learning is a field of computer science that gives
computers the ability to learn without being explicitly programmed. Machine
learning can be categorized in following three categories.
- Supervised machine learning,
- Unsupervised machine learning,
- Reinforcement learning
Deep learning:
Deep Learning is a subfield of machine learning concerned
with algorithms inspired by the structure and function of the brain called artificial
neural networks.
Ques: 13. What is
selection bias?
Ans: Selection bias is the bias introduced by the
selection of individuals, groups or data for analysis in such a way that proper
randomization is not achieved, thereby ensuring that the sample obtained is not
representative of the population intended to be analysed. It is sometimes
referred to as the selection effect. The phrase “selection bias” most often
refers to the distortion of a statistical analysis, resulting from the method
of collecting samples. If the selection bias is not considered, then some
conclusions of the study may not be accurate.
Ques: 14. What is
TF/IDF vectorization?
Ans: TF–IDF is
short for term frequency–inverse document frequency, is a numerical statistic
that is intended to reflect how important a word is to a document in a
collection or corpus. It is often used as a weighting factor in information
retrieval and text mining. The TF-IDF value increases proportionally to the
number of times a word appears in the document, but is offset by the frequency
of the word in the corpus, which helps to adjust for the fact that some words
appear more frequently in general.
Ques: 15. What is the
difference between Regression and classification ML techniques?
Ans: Both
Regression and classification machine learning techniques come under Supervised
machine learning algorithms. In Supervised machine learning algorithm, we must
train the model using labelled dataset, while training we must explicitly
provide the correct labels and algorithm tries to learn the pattern from input
to output. If our labels are discreate values then it will a classification
problem, e.g A,B etc. but if our labels are continuous values then it will be a
regression problem, e.g 1.23, 1.333 etc.
Ques: 16. What is
p-value?
Ans: When you
perform a hypothesis test in statistics, a p-value can help you determine the
strength of your results. p-value is a number between 0 and 1. Based on the
value it will denote the strength of the results. The claim which is on trial
is called Null Hypothesis.
Low p-value (≤ 0.05) indicates strength against the null
hypothesis which means we can reject the null Hypothesis. High p-value (≥ 0.05)
indicates strength for the null hypothesis which means we can accept the null
Hypothesis p-value of 0.05 indicates the Hypothesis could go either way. To put
it in another way,
High P values: your data are likely with a true null.
Low P values: your data are unlikely with a true null.
Ques: 17. What are
the differences between overfitting and underfitting?
Ans:
In order to make reliable predictions on general untrained data in machine
learning and statistics, it is required to fit a (machine learning) model to a
set of training data. Overfitting and underfitting are two of the most common modelling
errors that occur while doing so.
Following are the differences between overfitting
and underfitting:
Definition - A statistical model suffering from
overfitting describes some random error or noise in place of the underlying
relationship. When underfitting occurs, a statistical model or machine learning
algorithm fails in capturing the underlying trend of the data.
Occurrence – When a statistical model or machine
learning algorithm is excessively complex, it can result in overfitting.
Example of a complex model is one having too many parameters when compared to
the total number of observations. Underfitting occurs when trying to fit a
linear model to non-linear data.
Poor Predictive Performance – Although both
overfitting and underfitting yield poor predictive performance, the way in
which each one of them does so is different. While the overfitted model
overreacts to minor fluctuations in the training data, the underfit model
under-reacts to even bigger fluctuations.
Ques: 18. Could you explain
the role of data cleaning in data analysis?
Ans:
Data cleaning can be a daunting task since with the increase in the number of
data sources, the time required for cleaning the data increases at an
exponential rate.
This is due to the vast volume of data generated by
additional sources. Also, data cleaning can solely take up to 80% of the total
time required for carrying out a data analysis task.
Nevertheless, there are several reasons for using
data cleaning in data analysis. Two of the most important ones are:
- Cleaning data from different sources helps in transforming the data into a format that is easy to work with.
- Data cleaning increases the accuracy of a machine learning model.
Ques: 19. Can you
explain Recommender Systems along with an application?
Ans:
Recommender Systems is a subclass of information filtering systems, meant for
predicting the preferences or ratings awarded by a user to some product.
An application of a recommender system is the
product recommendations section in Amazon. This section contains items based on
the user’s search history and past orders.
Ques: 20. What is
exploding gradients?
Ans: “Exploding
gradients are a problem where large error gradients accumulate and result in
very large updates to neural network model weights during training.” At an
extreme, the values of weights can become so large as to overflow and result in
NaN values.
This has the effect of your model being unstable and unable
to learn from your training data.
Gradient: Gradient is the direction and magnitude calculated
during training of a neural network that is used to update the network weights
in the right direction and by the right amount.
No comments:
Post a comment