Top 20 Python Pandas Interview Questions and Answers

Pandas is a Python library that provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It is open-source and BSD-licensed. Python with Pandas is utilised in a variety of academic and commercial disciplines, including finance, economics, statistics, analytics, and more.

Data analysis necessitates a great deal of processing, such as restructuring, cleansing, or combining, among other things. Numpy, Scipy, Cython, and Panda are just a few of the quick data processing tools available. However, we favour Pandas since they are faster, easier, and more expressive than other tools.

Python Interview Questions & Answers

Ques. 1): What is Pandas? What is the purpose of Python pandas?

Answer:

Pandas is a Python module that provides quick, versatile, and expressive data structures that make working with "relational" or "labelled" data simple and intuitive. Its goal is to serve as the foundation for undertaking realistic, real-world data analysis in Python.

Pandas is a data manipulation and analysis software library for the Python programming language. It includes data structures and methods for manipulating numerical tables and time series, in particular. Pandas is open-source software distributed under the BSD three-clause licence.

Ques. 2): Mention the many types of data structures available in Pandas?

Answer:

The pandas library supports two data structures: Series and DataFrames. Numpy is used to construct both data structures. In pandas, a Series is a one-dimensional data structure, while a DataFrame is a two-dimensional data structure. Panel is another axis label that is a three-dimensional data structure that comprises items, major axis, and minor axis.

Ques. 3): What are the key features of pandas library ? What is pandas Used For ?

Answer:

There are various features in pandas library and some of them are mentioned below

Data Alignment

Memory Efficient

Reshaping

Merge and join

Time Series

This library is developed in Python and can be used to do data processing, data analysis, and other tasks. To manipulate time series and numerical tables, the library contains numerous operations as well as data structures.

Ques. 4): What is Pandas NumPy?

Answer:

Pandas Numpy is an open-source Python module that allows you to work with a huge number of datasets. For scientific computing with Python, it has a powerful N-dimensional array object and complex mathematical methods.

Fourier transformations, linear algebra, and random number capabilities are some of Numpy's most popular features. It also includes integration tools for C/C++ and Fortran programming.

Ques. 5): In Pandas, what is a Time Series?

Answer:

An ordered sequence of data that depicts how a quantity evolves over time is known as a time series. For all fields, pandas has a wide range of capabilities and tools for working with time series data.

pandas supports:

Taking time series data from a variety of sources and formats and parsing it

Create a series of dates and time ranges with a set frequency.

Manipulation and conversion of date and time with timezone data

A time series is resampled or converted to a specific frequency.

Using absolute or relative time increments to do date and time arithmetic.

Ques. 6): In pandas, what is a DataFrame?

Answer:

Pandas DataFrame is a possibly heterogeneous two-dimensional size-mutable tabular data format with labelled axes (rows and columns). A data frame is a two-dimensional data structure in which data is organised in rows and columns in a tabular format. The data, rows, and columns are the three main components of a Pandas DataFrame.

Creating a Pandas DataFrame-

A Pandas DataFrame is built in the real world by loading datasets from existing storage, which can be a SQL database, a CSV file, or an Excel file. Pandas DataFrames can be made from lists, dictionaries, and lists of dictionaries, among other things. A dataframe can be constructed in a variety of ways.

Creating a dataframe using List: DataFrame can be created using a single list or a list of lists.

Ques. 7): Explain Series In pandas. How To Create Copy Of Series In pandas?

Answer:

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

>>> s = pd.Series(data, index=index), where the data can be a Python dict, an ndarray or a scalar value.

To create a copy in pandas, we can call copy() function on a series such that

s2=s1.copy() will create copy of series s1 in a new series s2.

Ques. 8): How will you create an empty DataFrame in pandas?

Answer:

To create a completely empty Pandas dataframe, we use do the following:

import pandas as pd

MyEmptydf = pd.DataFrame()

This will create an empty dataframe with no columns or rows.

To create an empty dataframe with three empty column (columns X, Y and Z), we do:

df = pd.DataFrame(columns=[‘X’, ‘Y’, ‘Z’])

Ques. 9): What is Python pandas vectorization?

Answer:

The process of executing operations on the full array is known as vectorization. This is done to reduce the number of times the functions iterate. Pandas has a number of vectorized functions, such as aggregations and string functions, that are designed to work with series and DataFrames especially. To perform the operations quickly, it is preferable to use the vectorized pandas functions.

Ques. 10): range () vs and xrange () functions in Python?

Answer:

In Python 2 we have the following two functions to produce a list of numbers within a given range.

range()

xrange()

in Python 3, xrange() is deprecated, i.e. xrange() is removed from python 3.x.

Now In Python 3, we have only one function to produce the numbers within a given range i.e. range() function.

But, range() function of python 3 works same as xrange() of python 2 (i.e. internal implementation of range() function of python 3 is same as xrange() of Python 2).

So The difference between range() and xrange() functions becomes relevant only when you are using python 2.

range() and xrange() function values

a). range() creates a list i.e., range returns a Python list object, for example, range (1,500,1) will create a python list of 499 integers in memory. Remember, range() generates all numbers at once.

b).xrange() functions returns an xrange object that evaluates lazily. That means xrange only stores the range arguments and generates the numbers on demand. It doesn’t generate all numbers at once like range(). Furthermore, this object only supports indexing, iteration, and the len() function.

On the other hand xrange() generates the numbers on demand. That means it produces number one by one as for loop moves to the next number. In every iteration of for loop, it generates the next number and assigns it to the iterator variable of for loop.

Ques. 11): What does categorical data mean in Pandas?

Answer:

Categorical data is a Pandas data type that correlates to a statistical categorical variable. A categorical variable is one that has a restricted number of possible values, which is usually fixed. Gender, country of origin, blood type, social status, observation time, and Likert scale ratings are just a few examples. Categorical data values are either in categories or np.nan.This data type is useful in the following cases:

It is useful for a string variable that consists of only a few different values. If we want to save some memory, we can convert a string variable to a categorical variable.

It is useful for the lexical order of a variable that is not the same as the logical order (“one”, “two”, “three”) By converting into a categorical and specify an order on the categories, sorting and min/max is responsible for using the logical order instead of the lexical order.

It is useful as a signal to other Python libraries because this column should be treated as a categorical variable.

Ques. 12): To a Pandas DataFrame, how do you add an index, a row, or a column?

Answer:

Adding an Index into a DataFrame: If you create a DataFrame with Pandas, you can add the inputs to the index argument. It will ensure that you get the index you want. If no inputs are specified, the DataFrame has a numerically valued index that starts at 0 and terminates on the DataFrame's last row.

Increasing the number of rows in a DataFrame: To insert rows in the DataFrame, we can use the.loc, iloc, and ix commands.

The loc is primarily used for our index's labels. It can be seen as if we insert in loc[4], which means we're seeking for DataFrame items with an index of 4.

The ix is a complex case because if the index is integer-based, we pass a label to ix. The ix[4] means that we are looking in the DataFrame for those values that have an index labeled 4. However, if the index is not only integer-based, ix will deal with the positions as iloc.

Ques. 13): How to Delete Indices, Rows or Columns From a Pandas Data Frame?

Answer:

Deleting an Index from Your DataFrame

If you want to remove the index from the DataFrame, you should have to do the following:

Reset the index of DataFrame.

Executing del df.index.name to remove the index name.

Remove duplicate index values by resetting the index and drop the duplicate values from the index column.

Remove an index with a row.

Deleting a Column from Your DataFrame

You can use the drop() method for deleting a column from the DataFrame.

The axis argument that is passed to the drop() method is either 0 if it indicates the rows and 1 if it drops the columns.

You can pass the argument inplace and set it to True to delete the column without reassign the DataFrame.

You can also delete the duplicate values from the column by using the drop_duplicates() method.

Removing a Row from Your DataFrame

By using df.drop_duplicates(), we can remove duplicate rows from the DataFrame.

You can use the drop() method to specify the index of the rows that we want to remove from the DataFrame.

Ques. 14): How to convert String to date?

Answer:

The below code demonstrates how to convert the string to date:

From datetime import datetime

# Define dates as the strings

dmy_str1 = ‘Wednesday, July 14, 2018’

dmy_str2 = ’14/7/17′

dmy_str3 = ’14-07-2017′

# Define dates as the datetime objects

dmy_dt1 = datetime.strptime(date_str1, ‘%A, %B %d, %Y’)

dmy_dt2 = datetime.strptime(date_str2, ‘%m/%d/%y’)

dmy_dt3 = datetime.strptime(date_str3, ‘%m-%d-%Y’)

#Print the converted dates

print(dmy_dt1)

print(dmy_dt2)

print(dmy_dt3)

Ques. 15): What exactly is the Pandas Index?

Answer:

Pandas indexing is as follows:

In pandas, indexing simply involves picking specific rows and columns of data from a DataFrame. Selecting all of the rows and some of the columns, part of the rows and all of the columns, or some of each of the rows and columns is what indexing entails. Subset selection is another name for indexing.

Using [],.loc[],.iloc[],.ix[] for Pandas indexing

A DataFrame's items, rows, and columns can be extracted in a variety of methods. In Pandas, there are some indexing methods that can be used to retrieve an element from a DataFrame. These indexing systems look to be fairly similar on the surface, however they perform extremely differently. Pandas supports four different methods of multi-axes indexing:

Dataframe.[ ] ; This function also known as indexing operator

Dataframe.loc[ ] : This function is used for labels.

Dataframe.iloc[ ] : This function is used for positions or integer based

Dataframe.ix[] : This function is used for both label and integer based

Collectively, they are called the indexers. These are by far the most common ways to index data. These are four function which help in getting the elements, rows, and columns from a DataFrame.

Ques. 16): Define ReIndexing?

Answer:

Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.

Multiple operations can be accomplished through indexing like −

Reorder the existing data to match a new set of labels.

Insert missing value (NA) markers in label locations where no data for the label existed.

Ques. 17): How to Set the index?

Answer:

Python is an excellent language for data analysis, thanks to its vast ecosystem of data-centric Python packages. One of these packages is Pandas, which makes importing and analysing data a lot easier.

Pandas set index() is a function for setting the index of a Data Frame from a List, Series, or Data Frame. A data frame's index column can also be set while it's being created. However, because a data frame might be made up of two or more data frames, the index can be altered later using this method.

Syntax:

DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False

Ques. 18): Define GroupBy in Pandas?

Answer:

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.

Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

Parameters :

by : mapping, function, str, or iterable

axis : int, default 0

level : If the axis is a MultiIndex (hierarchical), group by a particular level or levels

as_index : For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output

sort : Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.

group_keys : When calling apply, add group keys to index to identify pieces

squeeze : Reduce the dimensionality of the return type if possible, otherwise return a consistent type

Returns : GroupBy object

Ques. 19): How will you add a scalar column with same value for all rows to a pandas DataFrame?

Answer:

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Dataframe.add() method is used for addition of dataframe and other, element-wise (binary operator add). Equivalent to dataframe + other, but with support to substitute a fill_value for missing data in one of the inputs.

Syntax: DataFrame.add(other, axis=’columns’, level=None, fill_value=None)

Parameters:

other :Series, DataFrame, or constant

axis :{0, 1, ‘index’, ‘columns’} For Series input, axis to match Series index on

fill_value : [None or float value, default None] Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing.

level : [int or name] Broadcast across a level, matching Index values on the passed MultiIndex level

Returns: result DataFrame

Ques. 20): In pandas, how can you see if a DataFrame is empty?

Answer:

Pandas DataFrame is a possibly heterogeneous two-dimensional size-mutable tabular data format with labelled axes (rows and columns). Both the row and column labels align for arithmetic operations. It can be viewed of as a container for Series items, similar to a dict. The Pandas' fundamental data structure is this.

Pandas DataFrame is a dataframe for Pandas.

The empty attribute determines whether or not the dataframe is empty. If the dataframe is empty, it returns True; otherwise, it returns False.

Syntax: DataFrame.empty

Parameter : None

Returns : bool

Top Technical Interviews Questions and Answers for AWS Cloud, Java, Oracle

December 30, 2021

Top 20 Python Pandas Interview Questions and Answers

November 17, 2021

Top 20 Apache Spark Interview Questions & Answers