Top 20 Python Pandas Interview Questions and Answers

Pandas is a Python library that provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It is open-source and BSD-licensed. Python with Pandas is utilised in a variety of academic and commercial disciplines, including finance, economics, statistics, analytics, and more.

Data analysis necessitates a great deal of processing, such as restructuring, cleansing, or combining, among other things. Numpy, Scipy, Cython, and Panda are just a few of the quick data processing tools available. However, we favour Pandas since they are faster, easier, and more expressive than other tools.

Python Interview Questions & Answers

Ques. 1): What is Pandas? What is the purpose of Python pandas?

Answer:

Pandas is a Python module that provides quick, versatile, and expressive data structures that make working with "relational" or "labelled" data simple and intuitive. Its goal is to serve as the foundation for undertaking realistic, real-world data analysis in Python.

Pandas is a data manipulation and analysis software library for the Python programming language. It includes data structures and methods for manipulating numerical tables and time series, in particular. Pandas is open-source software distributed under the BSD three-clause licence.

Ques. 2): Mention the many types of data structures available in Pandas?

Answer:

The pandas library supports two data structures: Series and DataFrames. Numpy is used to construct both data structures. In pandas, a Series is a one-dimensional data structure, while a DataFrame is a two-dimensional data structure. Panel is another axis label that is a three-dimensional data structure that comprises items, major axis, and minor axis.

Ques. 3): What are the key features of pandas library ? What is pandas Used For ?

Answer:

There are various features in pandas library and some of them are mentioned below

Data Alignment

Memory Efficient

Reshaping

Merge and join

Time Series

This library is developed in Python and can be used to do data processing, data analysis, and other tasks. To manipulate time series and numerical tables, the library contains numerous operations as well as data structures.

Ques. 4): What is Pandas NumPy?

Answer:

Pandas Numpy is an open-source Python module that allows you to work with a huge number of datasets. For scientific computing with Python, it has a powerful N-dimensional array object and complex mathematical methods.

Fourier transformations, linear algebra, and random number capabilities are some of Numpy's most popular features. It also includes integration tools for C/C++ and Fortran programming.

Ques. 5): In Pandas, what is a Time Series?

Answer:

An ordered sequence of data that depicts how a quantity evolves over time is known as a time series. For all fields, pandas has a wide range of capabilities and tools for working with time series data.

pandas supports:

Taking time series data from a variety of sources and formats and parsing it

Create a series of dates and time ranges with a set frequency.

Manipulation and conversion of date and time with timezone data

A time series is resampled or converted to a specific frequency.

Using absolute or relative time increments to do date and time arithmetic.

Ques. 6): In pandas, what is a DataFrame?

Answer:

Pandas DataFrame is a possibly heterogeneous two-dimensional size-mutable tabular data format with labelled axes (rows and columns). A data frame is a two-dimensional data structure in which data is organised in rows and columns in a tabular format. The data, rows, and columns are the three main components of a Pandas DataFrame.

Creating a Pandas DataFrame-

A Pandas DataFrame is built in the real world by loading datasets from existing storage, which can be a SQL database, a CSV file, or an Excel file. Pandas DataFrames can be made from lists, dictionaries, and lists of dictionaries, among other things. A dataframe can be constructed in a variety of ways.

Creating a dataframe using List: DataFrame can be created using a single list or a list of lists.

Ques. 7): Explain Series In pandas. How To Create Copy Of Series In pandas?

Answer:

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

>>> s = pd.Series(data, index=index), where the data can be a Python dict, an ndarray or a scalar value.

To create a copy in pandas, we can call copy() function on a series such that

s2=s1.copy() will create copy of series s1 in a new series s2.

Ques. 8): How will you create an empty DataFrame in pandas?

Answer:

To create a completely empty Pandas dataframe, we use do the following:

import pandas as pd

MyEmptydf = pd.DataFrame()

This will create an empty dataframe with no columns or rows.

To create an empty dataframe with three empty column (columns X, Y and Z), we do:

df = pd.DataFrame(columns=[‘X’, ‘Y’, ‘Z’])

Ques. 9): What is Python pandas vectorization?

Answer:

The process of executing operations on the full array is known as vectorization. This is done to reduce the number of times the functions iterate. Pandas has a number of vectorized functions, such as aggregations and string functions, that are designed to work with series and DataFrames especially. To perform the operations quickly, it is preferable to use the vectorized pandas functions.

Ques. 10): range () vs and xrange () functions in Python?

Answer:

In Python 2 we have the following two functions to produce a list of numbers within a given range.

range()

xrange()

in Python 3, xrange() is deprecated, i.e. xrange() is removed from python 3.x.

Now In Python 3, we have only one function to produce the numbers within a given range i.e. range() function.

But, range() function of python 3 works same as xrange() of python 2 (i.e. internal implementation of range() function of python 3 is same as xrange() of Python 2).

So The difference between range() and xrange() functions becomes relevant only when you are using python 2.

range() and xrange() function values

a). range() creates a list i.e., range returns a Python list object, for example, range (1,500,1) will create a python list of 499 integers in memory. Remember, range() generates all numbers at once.

b).xrange() functions returns an xrange object that evaluates lazily. That means xrange only stores the range arguments and generates the numbers on demand. It doesn’t generate all numbers at once like range(). Furthermore, this object only supports indexing, iteration, and the len() function.

On the other hand xrange() generates the numbers on demand. That means it produces number one by one as for loop moves to the next number. In every iteration of for loop, it generates the next number and assigns it to the iterator variable of for loop.

Ques. 11): What does categorical data mean in Pandas?

Answer:

Categorical data is a Pandas data type that correlates to a statistical categorical variable. A categorical variable is one that has a restricted number of possible values, which is usually fixed. Gender, country of origin, blood type, social status, observation time, and Likert scale ratings are just a few examples. Categorical data values are either in categories or np.nan.This data type is useful in the following cases:

It is useful for a string variable that consists of only a few different values. If we want to save some memory, we can convert a string variable to a categorical variable.

It is useful for the lexical order of a variable that is not the same as the logical order (“one”, “two”, “three”) By converting into a categorical and specify an order on the categories, sorting and min/max is responsible for using the logical order instead of the lexical order.

It is useful as a signal to other Python libraries because this column should be treated as a categorical variable.

Ques. 12): To a Pandas DataFrame, how do you add an index, a row, or a column?

Answer:

Adding an Index into a DataFrame: If you create a DataFrame with Pandas, you can add the inputs to the index argument. It will ensure that you get the index you want. If no inputs are specified, the DataFrame has a numerically valued index that starts at 0 and terminates on the DataFrame's last row.

Increasing the number of rows in a DataFrame: To insert rows in the DataFrame, we can use the.loc, iloc, and ix commands.

The loc is primarily used for our index's labels. It can be seen as if we insert in loc[4], which means we're seeking for DataFrame items with an index of 4.

The ix is a complex case because if the index is integer-based, we pass a label to ix. The ix[4] means that we are looking in the DataFrame for those values that have an index labeled 4. However, if the index is not only integer-based, ix will deal with the positions as iloc.

Ques. 13): How to Delete Indices, Rows or Columns From a Pandas Data Frame?

Answer:

Deleting an Index from Your DataFrame

If you want to remove the index from the DataFrame, you should have to do the following:

Reset the index of DataFrame.

Executing del df.index.name to remove the index name.

Remove duplicate index values by resetting the index and drop the duplicate values from the index column.

Remove an index with a row.

Deleting a Column from Your DataFrame

You can use the drop() method for deleting a column from the DataFrame.

The axis argument that is passed to the drop() method is either 0 if it indicates the rows and 1 if it drops the columns.

You can pass the argument inplace and set it to True to delete the column without reassign the DataFrame.

You can also delete the duplicate values from the column by using the drop_duplicates() method.

Removing a Row from Your DataFrame

By using df.drop_duplicates(), we can remove duplicate rows from the DataFrame.

You can use the drop() method to specify the index of the rows that we want to remove from the DataFrame.

Ques. 14): How to convert String to date?

Answer:

The below code demonstrates how to convert the string to date:

From datetime import datetime

# Define dates as the strings

dmy_str1 = ‘Wednesday, July 14, 2018’

dmy_str2 = ’14/7/17′

dmy_str3 = ’14-07-2017′

# Define dates as the datetime objects

dmy_dt1 = datetime.strptime(date_str1, ‘%A, %B %d, %Y’)

dmy_dt2 = datetime.strptime(date_str2, ‘%m/%d/%y’)

dmy_dt3 = datetime.strptime(date_str3, ‘%m-%d-%Y’)

#Print the converted dates

print(dmy_dt1)

print(dmy_dt2)

print(dmy_dt3)

Ques. 15): What exactly is the Pandas Index?

Answer:

Pandas indexing is as follows:

In pandas, indexing simply involves picking specific rows and columns of data from a DataFrame. Selecting all of the rows and some of the columns, part of the rows and all of the columns, or some of each of the rows and columns is what indexing entails. Subset selection is another name for indexing.

Using [],.loc[],.iloc[],.ix[] for Pandas indexing

A DataFrame's items, rows, and columns can be extracted in a variety of methods. In Pandas, there are some indexing methods that can be used to retrieve an element from a DataFrame. These indexing systems look to be fairly similar on the surface, however they perform extremely differently. Pandas supports four different methods of multi-axes indexing:

Dataframe.[ ] ; This function also known as indexing operator

Dataframe.loc[ ] : This function is used for labels.

Dataframe.iloc[ ] : This function is used for positions or integer based

Dataframe.ix[] : This function is used for both label and integer based

Collectively, they are called the indexers. These are by far the most common ways to index data. These are four function which help in getting the elements, rows, and columns from a DataFrame.

Ques. 16): Define ReIndexing?

Answer:

Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.

Multiple operations can be accomplished through indexing like −

Reorder the existing data to match a new set of labels.

Insert missing value (NA) markers in label locations where no data for the label existed.

Ques. 17): How to Set the index?

Answer:

Python is an excellent language for data analysis, thanks to its vast ecosystem of data-centric Python packages. One of these packages is Pandas, which makes importing and analysing data a lot easier.

Pandas set index() is a function for setting the index of a Data Frame from a List, Series, or Data Frame. A data frame's index column can also be set while it's being created. However, because a data frame might be made up of two or more data frames, the index can be altered later using this method.

Syntax:

DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False

Ques. 18): Define GroupBy in Pandas?

Answer:

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.

Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

Parameters :

by : mapping, function, str, or iterable

axis : int, default 0

level : If the axis is a MultiIndex (hierarchical), group by a particular level or levels

as_index : For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output

sort : Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.

group_keys : When calling apply, add group keys to index to identify pieces

squeeze : Reduce the dimensionality of the return type if possible, otherwise return a consistent type

Returns : GroupBy object

Ques. 19): How will you add a scalar column with same value for all rows to a pandas DataFrame?

Answer:

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Dataframe.add() method is used for addition of dataframe and other, element-wise (binary operator add). Equivalent to dataframe + other, but with support to substitute a fill_value for missing data in one of the inputs.

Syntax: DataFrame.add(other, axis=’columns’, level=None, fill_value=None)

Parameters:

other :Series, DataFrame, or constant

axis :{0, 1, ‘index’, ‘columns’} For Series input, axis to match Series index on

fill_value : [None or float value, default None] Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing.

level : [int or name] Broadcast across a level, matching Index values on the passed MultiIndex level

Returns: result DataFrame

Ques. 20): In pandas, how can you see if a DataFrame is empty?

Answer:

Pandas DataFrame is a possibly heterogeneous two-dimensional size-mutable tabular data format with labelled axes (rows and columns). Both the row and column labels align for arithmetic operations. It can be viewed of as a container for Series items, similar to a dict. The Pandas' fundamental data structure is this.

Pandas DataFrame is a dataframe for Pandas.

The empty attribute determines whether or not the dataframe is empty. If the dataframe is empty, it returns True; otherwise, it returns False.

Syntax: DataFrame.empty

Parameter : None

Returns : bool

May 05, 2019

Top 20 Oracle RDMS Interview Questions and Answers

Ques: 1. What is an Index? Explain the different types of index.

Answer:

An index is a performance enhancement method that allows faster retrieval of records from the table. An index creates an entry for each value thus making data retrieval faster.
While creating an index, we should remember the columns which will be used to make SQL queries and create one or more indexes on those columns.
Following are the available indexes.

a. Clustered index:

It sorts and stores the rows of data in the table or view, based on its keys. These are the columns included in the index definition. There can be only one clustered index per table because sorting of data rows can be done only in one order.

b. Non-clustered index:

It contains the non-clustered index key value and each key value entry, in turn, has a pointer to the data row. Thus a non-clustered index contains a pointer to the physical location of the record. Each table can have 999 non-clustered indexes.

c. Unique Index:

This indexing does not allow the field to have duplicate values if the column is unique indexed. It can be applied automatically when a primary key is defined.

Oracle Fusion Applications interview Questions and Answers

Ques: 2. What are Constraints? Explain the different Constraints available in SQL?

Answer:

These are the set of rules that determine or restrict the type of data that can go into a table, to maintain the accuracy and integrity of the data inside the table. Following are the most frequent used constraints, applicable to a table:

<NOT NULL> It restricts a column from holding a NULL value. It does not work on a table.
<UNIQUE> It ensures that a field or column will only have unique values. It is applicable to both column and table.
<PRIMARY KEY> uniquely identifies each record in a database table and it cannot contain NULL values.
<FOREIGN KEY> It is used to relate two tables. The FOREIGN KEY constraint is also used to restrict actions that would destroy links between tables.
<CHECK CONSTRAINT> It is used to restrict the value of a column between a range. It performs a check on the values, before storing them into the database. It’s like condition checking before saving data into a column.
<DEFAULT> It is used to insert a default value into a column.

Oracle Accounts Payables Interview Questions and Answers

Ques: 3. What are Triggers? What are its benefits? Can we invoke a trigger explicitly?

Answer:

The trigger is a type of stored program, which gets fired automatically when some event occurs. We write a Trigger as a response to either of the following event:

A database manipulation (DML) statement (DELETE, INSERT, or UPDATE).
A database definition (DDL) statement (CREATE, ALTER, or DROP).
A database operation (SERVERERROR, LOGON, LOGOFF, STARTUP, or SHUTDOWN).
SQL allows defining Trigger on the table, view, schema, or database associated with the event.

Following are its benefits:

Generating some derived column values automatically.
Enforcing referential integrity.
Event logging and storing information on table access.
Auditing.
Synchronous replication of tables.
Imposing security authorizations.
Preventing invalid transactions.

It is not possible to invoke a trigger explicitly. It gets invoked automatically if an event gets executed on the table having an association with the trigger.

Oracle ADF Interview Questions and Answers

Ques: 4. What is the purpose of isolation levels in SQL?

Answer:

Transactions use an isolation level that specifies the extent to which a transaction must be isolated from any data modifications caused by other transactions. These also help in identifying which concurrency side-effects are permissible.

Please refer the below list for more clarity on the different type of levels.

i. Read Committed.

It ensures that SELECT query will use committed values of the table only. If there is any active transaction on the table in some other session, then the SELECT query will wait for any such transactions to complete. Read Committed is the default transaction isolation level.

ii Read Uncommitted.

There is a transaction to update a table. But, it is not able to reach to any of these states like complete, commit or rollback. Then these values get displayed (as Dirty Read) in SELECT query of “Read Uncommitted” isolation transaction.

iii. Repeatable Read.

This level doesn’t guarantee that reads are repeatable. But it does ensure that data won’t change for the life of the transaction once.

iv. Serializable.

It is similar to Repeatable Read level. The only difference is that it stops Phantom Read and utilizes the range lock. If the table has an index, then it secures the records based on the range defined in the WHERE clause (like where ID between 1 and 3). If a table does not have an index, then it locks complete table.

v. Snapshot.

It is similar to Serializable isolation. The difference is that Snapshot does not hold a lock on a table during the transaction. Thus allowing the table to get modified in other sessions. Snapshot isolation maintains versioning in Tempdb for old data. In case any data modification happens in other sessions then existing transaction displays the old data from Tempdb.

Oracle Access Manager Interview Questions and Answers

Ques: 5. How do we Tune the Queries?

Answer:

Queries can be tuned by Checking the logic (table joins), by creating Indexes on objects in the where clause, by avoiding full table scans. Finally use the trace utility to generate the trace file, use the TK-Prof utility to generate a statistical a nalysis about the query using which appropriate actions can be taken.

Oracle Fusion HCM Interview Questions and Answers

Ques: 6. What is the difference between BEFORE and AFTER in Database Triggers?

Answer:

BEFORE triggers, are usually used when validation needs to take place before accepting the change. They run before any change is made to the database. Let’s say you run a database for a bank. You have a table accounts and a table transaction. If a user makes a withdrawal from his account, you would want to make sure that the user has enough credits in his account for his withdrawal. The BEFORE trigger will allow to do that and prevent the row from being inserted in transactions if the balance in accounts is not enough.

Oracle SCM Interview Questions and Answers

AFTER triggers, are usually used when information needs to be updated in a separate table due to a change. They run after changes have been made to the database (not necessarily committed). Let’s go back to our back example. After a successful transaction, you would want balance to be updated in the accounts table. An AFTER trigger will allow you to do exactly that.

Top Technical Interviews Questions and Answers for AWS Cloud, Java, Oracle

December 30, 2021

Top 20 Python Pandas Interview Questions and Answers

May 05, 2019

Top 20 Oracle RDMS Interview Questions and Answers