Tutorial: Data Processing with Pandas Dataframe

Cambridge Spark

Published in

Cambridge Spark

5 min readJan 23, 2020

The purpose of this tutorial is to teach you how to process data with Pandas DataFrame.

At the end of this tutorial, you will be able to:

load a dataset,
explore data and rename columns,
check and select columns,
change columns’ names,
describe data,
identify missing values,
iterate over rows and columns,
group data items,
concenate dataframes.

Resources

For this tutorial, the libraries we will need are Python, Numpy, Pandas, and Matplotlib. The version of the libraries that we will be using in this tutorial is as follows.

Data

For this tutorial, we will be working on Titanic dataset. You can download it from https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html.

Loading Data

Download the titanic.csv file to your computer, read the data using the following piece of code:

data = pd.read_csv(“your_file_location/titanic.csv”)

Exploring Data and Renaming Columns

First of all, let’s look at the first rows in the dataset to see how it looks like.

Note:
By entering a number into the brackets such as df.head(3), you specify how many columns to be shown.
If you leave it empty, it displays the first five rows by default.

Checking and Selecting Columns

Next, let’s check what columns do we have.

Then, we can specify what columns to use. To do that, we select the columns. For example:

Then, we can print the last five rows and datatypes to see how new dataframe looks like.

Changing the Columns’ Names

Let’s change the name of the columns. We will be working on our original data with eight columns.

As can be seen above, we have successfully created the dictionary. Now, we can change the names of the columns, by passing that dict into parameter columns in rename().

Describing Data

Let’s check some basic statistics to understand our data better.

The describe functions give us descriptive statistics that summarise the count, mean, standard deviation, minimum. maximum, and quantile values. NaN values are ignored by default.

Missing Values

Pandas treat None and NaN for indicating missing or null values in data. Various functions are available to detect the missing values in Pandas DataFrame such as:
isnull()
notnull()

Note:
df.isnull() function displays all the values in the data as True or False. The True values represents the null values.
df.notnull() does the opposite of this function.

Using any(), we can see the summary of each column in terms of if there are any missing values.

Let’s summarise the values according to axis=1.

Nevertheless, what if we want to use “isnull()” function to display all rows where df has null values? In other words, what if we want to display the actual rows with null values instead of this df with True or False cells. To do that, we write the following code:

Iterating Over Rows and Columns

Let’s start with iterating rows and using self-made functions. To iterate throw rows, we use iterrows() function. See the example below.

To iterate throw columns, we use iteritems() function. See the example below.

Grouping

Pandas groupby() function is used to split the data into groups based on criteria. In other words, grouping is used to provide a mapping of labels to group names.

Let’s group our data according to PCLASS.

To resume PCLASS as a column, use reset_index.

We can plot the returned dataframe.

Concatenating

Pandas provides several functions for easily combining DataFrame. One of these functions is concat().

There are eight columns in our dataframe namely SURVIVED, PCLASS, NAME, SEX, AGE, SIBSA, PARCA, and FARE. Let’s create three different dataframes from our dataframe (df), then concat them with concat() function.

Now, we have three different dataframes.

Another way of combining the DataFrame is by using append() instance methods. They concatenate along axis=0.

Summary

Congratulations, you have reached the end of the Data Processing With Pandas DataFrame!

AUTHOR: DILEK CELIK

A PhD candidate in Computer Science and Information Systems at Birkbeck College, University of London. IBM, Stanford University and MIT certified professional in Data Science and Machine Learning with advanced Java, Python, R, Data Science and Machine Learning expertise and experiences. Teaching Assistant in Machine Learning, R, Python, and Java Modules of University College London and Birkbeck College, University of London.

Further tutorials to practice your skills on: