December 23, 2019

Top 20 Data Science Interview Questions and Answers


Ques: 1. What is the difference between Data Science and Data Analytics?


Data Scientists need to slice data to extract valuable insights that a data analyst can apply to real-world business scenarios. The main difference between the two is that the data scientists have more technical knowledge then business analyst. Moreover, they don’t need an understanding of the business required for data visualization.


Ques: 2. What is the method to collect and analyse data to use social media to predict the weather condition?


You can collect social media data using Facebook, twitter, Instagram's API's. For example, for the tweeter, we can construct a feature from each tweet like tweeted date, retweets, list of followers, etc. Then you can use a multivariate time series model to predict the weather condition.


Ques: 3. What is the Cross-Validation?


It is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in backgrounds where the objective is forecast, and one wants to estimate how accurately a model will accomplish in practice. The goal of cross-validation is to term a data set to test the model in the training phase (i.e., validation data set) to limit problems like overfitting and gain insight on how the model will generalize to an independent data set.


Ques: 4. What are the Steps in Making a “Decision Tree”?


The steps to make a “Decision Tree” are as follows:

  1. Take the entire data set as input.
  2. Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets.
  3. Apply the split to the input data (divide step).
  4. Re-apply steps 1 to 2 to the divided data.
  5. Stop when you meet some stopping criteria.This step is called pruning. 
  6. Clean up the tree if you went too far doing splits.


Ques: 5. Can you explain Star Schema?


It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.


Ques: 6. What are the various steps for a Data analytics project?


The following are important steps involved in an analytics project:

  1. Understand the Business problem. 
  2. Explore the data and study it carefully. 
  3. Prepare the data for modelling by finding missing values and transforming variables. 
  4. Start running the model and analyse the Big data result. 
  5. Validate the model with new data set. 
  6. Implement the model and track the result to analyze the performance of the model for a specific period.

Ques: 7. Why Data Cleansing is essential and which method you use to maintain clean data? Explain.


Dirty data often leads to the incorrect inside, which can damage the prospect of any organization. For example, if you want to run a targeted marketing campaign. However, our data incorrectly tell you that a specific product will be in-demand with your target audience; the campaign will fail.


Ques: 8. What is reinforcement learning?


Reinforcement Learning is a learning mechanism about how to map situations to actions. The end result should help you to increase the binary reward signal. In this method, a learner is not told which action to take but instead must discover which action offers a maximum reward. As this method based on the reward/penalty mechanism.


Ques: 9. While working on a data set, how can you select important variables? Explain.


Following methods of variable selection you can use:

  • Remove the correlated variables before selecting important variables
  • Use linear regression and select variables which depend on that p values.
  • Use Backward, Forward Selection, and Stepwise Selection
  • Use Xgboost, Random Forest, and plot variable importance chart.
  • Measure information gain for the given set of features and select top n features accordingly.


Ques: 10. What cross-validation technique would you use on a time series dataset?


Instead of using k-fold cross-validation, you should be aware to the fact that a time series is not randomly distributed data - It is inherently ordered by chronological order.

In case of time series data, you should use techniques like forward chaining – Where you will be model on past data then look at forward-facing data.

fold 1: training[1], test[2]

fold 1: training[1 2], test[3]

fold 1: training[1 2 3], test[4]

fold 1: training[1 2 3 4], test[5]


Ques: 11. What is deep learning?


Deep learning is subfield of machine learning inspired by structure and function of brain called artificial neural network. We have a lot of numbers of algorithms under machine learning like Linear regression, SVM, Neural network etc and deep learning is just an extension of Neural networks. In neural nets we consider small number of hidden layers but when it comes to deep learning algorithms we consider a huge number of hidden layers to better understand the input output relationship.


Ques: 12. What is the difference between machine learning and deep learning?


Machine learning:

Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. Machine learning can be categorized in following three categories.

  • Supervised machine learning,
  • Unsupervised machine learning,
  • Reinforcement learning

Deep learning:

Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.


Ques: 13. What is selection bias?


Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analysed. It is sometimes referred to as the selection effect. The phrase “selection bias” most often refers to the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not considered, then some conclusions of the study may not be accurate.


Ques: 14. What is TF/IDF vectorization?


TF–IDF is short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The TF-IDF value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.


Ques: 15. What is the difference between Regression and classification ML techniques?


Both Regression and classification machine learning techniques come under Supervised machine learning algorithms. In Supervised machine learning algorithm, we must train the model using labelled dataset, while training we must explicitly provide the correct labels and algorithm tries to learn the pattern from input to output. If our labels are discreate values then it will a classification problem, e.g A,B etc. but if our labels are continuous values then it will be a regression problem, e.g 1.23, 1.333 etc.


Ques: 16. What is p-value?


When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is a number between 0 and 1. Based on the value it will denote the strength of the results. The claim which is on trial is called Null Hypothesis.

Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can reject the null Hypothesis. High p-value (≥ 0.05) indicates strength for the null hypothesis which means we can accept the null Hypothesis p-value of 0.05 indicates the Hypothesis could go either way. To put it in another way,

High P values: your data are likely with a true null.
Low P values: your data are unlikely with a true null.

Ques: 17. What are the differences between overfitting and underfitting?


In order to make reliable predictions on general untrained data in machine learning and statistics, it is required to fit a (machine learning) model to a set of training data. Overfitting and underfitting are two of the most common modelling errors that occur while doing so.

Following are the differences between overfitting and underfitting:

Definition - A statistical model suffering from overfitting describes some random error or noise in place of the underlying relationship. When underfitting occurs, a statistical model or machine learning algorithm fails in capturing the underlying trend of the data.

Occurrence – When a statistical model or machine learning algorithm is excessively complex, it can result in overfitting. Example of a complex model is one having too many parameters when compared to the total number of observations. Underfitting occurs when trying to fit a linear model to non-linear data.

Poor Predictive Performance – Although both overfitting and underfitting yield poor predictive performance, the way in which each one of them does so is different. While the overfitted model overreacts to minor fluctuations in the training data, the underfit model under-reacts to even bigger fluctuations.


Ques: 18. Could you explain the role of data cleaning in data analysis?


Data cleaning can be a daunting task since with the increase in the number of data sources, the time required for cleaning the data increases at an exponential rate.

This is due to the vast volume of data generated by additional sources. Also, data cleaning can solely take up to 80% of the total time required for carrying out a data analysis task.

Nevertheless, there are several reasons for using data cleaning in data analysis. Two of the most important ones are:

  • Cleaning data from different sources helps in transforming the data into a format that is easy to work with. 
  • Data cleaning increases the accuracy of a machine learning model.


Ques: 19. Can you explain Recommender Systems along with an application?


Recommender Systems is a subclass of information filtering systems, meant for predicting the preferences or ratings awarded by a user to some product.

An application of a recommender system is the product recommendations section in Amazon. This section contains items based on the user’s search history and past orders.


Ques: 20. What is exploding gradients?


“Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training.” At an extreme, the values of weights can become so large as to overflow and result in NaN values.

This has the effect of your model being unstable and unable to learn from your training data.

Gradient: Gradient is the direction and magnitude calculated during training of a neural network that is used to update the network weights in the right direction and by the right amount.


No comments:

Post a Comment