Programming with Python
Indexing, slicing and subsetting DataFrames with pandas
Learning Objectives
- Learn about 0-based indexing in Python.
- Learn about numeric vs. label based indexes.
- Learn how to select subsets of data from a DataFrame using Slicing and Indexing methods.
- Understand what a boolean object is and how it can be used to “mask” or identify particular sets of values within another object.
We will continue to use the surveys dataset that we worked with in the last exercise. You can reopen it like this:
# first make sure pandas is loaded
import pandas as pd
# read in the survey csv
surveys_df = pd.read_csv("surveys.csv")
Indexing & slicing in Python
We often want to work with subsets of a DataFrame object instead of the whole thing. There are different ways to break the data up into pieces, either by using labels (column headings), numeric ranges or specific x,y index locations.
Selecting data using labels (column headings)
We use square brackets []
to select a subset of a Python object. For example, we can select all of the data from the column species
in the surveys_df
DataFrame by requesting it by name:
surveys_df['species']
Using the column heading as an attribute of the Python object gives the same output:
surveys_df.species
We can also create a new DataFrame to contain only the subset of data we are interested in:
# create an object named surveys_species that only contains the species column
surveys_species = surveys_df['species']
We can also use a list of column names as an index to extract all of those columns from the DataFrame. The columns of this subset of data will be in the same order as the list of column names. This is very useful when we need to reorganize our data!
# select the species and plot columns from the DataFrame
surveys_df[['species', 'plot']]
# what happens when you flip the order?
surveys_df[['plot', 'species']]
Extracting range based subsets: Slicing
Just as we did with lists in the previous lesson, we use the square brackets []
to extract a subset of a DataFrame:
# select the first, second and third rows from the surveys variable
surveys_df[0:3]
# select the first 5 rows (rows 0,1,2,3,4)
surveys_df[:5]
# select the last element in the list
surveys_df[-1:]
We can also use slicing to re-assign different values to some subset of the DataFrame. Like lists, DataFrames are mutable objects, meaning that they can be changed in place. Before continuing, let’s make a copy of our DataFrame to play with so we don’t accidentally modify the original imported data.
# copy the surveys dataframe so we don't modify the original DataFrame
surveys_copy = surveys_df.copy()
# set the first three rows of data in the DataFrame to 0
surveys_copy[0:3] = 0
Slicing subsets of rows and columns in Python
We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing. We have to use one of two methods, depending on what we use as indices:
loc
: indexing via labelsiloc
: indexing via integers
To select a subset of rows AND columns from our DataFrame, we can use the iloc
method. For example, we can select month, day and year (columns 2, 3 and 4 if we start counting at 1), like this:
surveys_df.iloc[0:3, 1:4]
month day year
0 7 16 1977
1 7 16 1977
2 7 16 1977
Notice that we asked for a slice from 0:3 and this yielded 3 rows of data. When you ask for 0:3, you are actually telling python to start at index 0 and select rows 0, 1, 2 up to but not including 3.
Let’s explore some other ways to index and select subsets of data:
# select all columns for rows of index values 0 and 10
surveys_df.loc[[0,10], :]
# select only three columns of the first row
surveys_df.loc[0, ['species', 'plot', 'wgt']]
We can also select a specific data value according to the specific row and column location within the data frame using the iloc
method:
# dat.iloc[row,column]
surveys_df.iloc[2,6]
'F'
Test your understanding of indices
What do each of these commands do?
{.python} surveys_df[0:3] surveys_df[:5] surveys_df[-1:]What is the difference between these two commands?
{.python} surveys_df.iloc[0:4, 1:4] surveys_df.loc[0:4, 1:4]
Subsetting Data Using Criteria
We can also select a subset of our data using some criteria. For example, we can select all rows that have a year value of 2002:
surveys_df[surveys_df.year == 2002]
record_id month day year plot species sex wgt
33320 33321 1 12 2002 1 DM M 44
33321 33322 1 12 2002 1 DO M 58
33322 33323 1 12 2002 1 PB M 45
33323 33324 1 12 2002 1 AB NaN NaN
33324 33325 1 12 2002 1 DO M 29
33325 33326 1 12 2002 2 OT F 26
33326 33327 1 12 2002 2 OT M 24
... ... ... ... ... ... ... ... ...
35541 35542 12 31 2002 15 PB F 29
35542 35543 12 31 2002 15 PB F 34
35543 35544 12 31 2002 15 US NaN NaN
35544 35545 12 31 2002 15 AH NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN
35546 35547 12 31 2002 10 RM F 14
35547 35548 12 31 2002 7 DO M 51
35548 35549 12 31 2002 5 NaN NaN NaN
[2229 rows x 8 columns]
We can also select all rows that do not contain the year 2002.
surveys_df[surveys_df.year != 2002]
We can use sets of multiple criteria, too:
surveys_df[(surveys_df.year >= 1980) & (surveys_df.year <= 1985)]
Subset puzzles
- Select a subset of rows in the
surveys_df
DataFrame that contain data from the year 1999 and weight values less than or equal to 8. How many columns did you end up with? What did your neighbor get? - You can use the
isin
method in Python to query a DataFrame based upon a list of values like this:surveys_df[surveys_df['sex'].isin([listGoesHere])]
. Use theisin
function to find all plots that contain species of sex “Z” or sex “R” or sex “P” in the surveys DataFrame. How many records contain these values? - Create a query that finds all rows with a weight value > or equal to 0.
- The
~
symbol in Python can be used to return the OPPOSITE of the selection you specified. It is equivalent to is not in. Write a query that selects all rows that are not of sex “M” or “F” in the data.
Using Masks
A mask can be useful to locate where a particular subset of values exist or don’t exist. Masks are BOOLEAN
objects. Booleans can only take the values of true
or false
:
# set x to 5
x = 5
# what does this return?
x > 5
False
# what about this?
x == 5
True
When Python creates a mask, it performs the comparison operation for each value in the object. The mask is a new object that is the same shape as the original object but with a True or False value for each index location.
Let’s identify all locations in the survey data that have null (missing or NaN, for Not-a-Number) data values. We can use the isnull
method to do this. Each cell that has a null value will be assigned a value of True
in the new mask:
pd.isnull(surveys_df)
record_id month day year plot species sex wgt
0 False False False False False True False True
1 False False False False False True False True
2 False False False False False False False True
3 False False False False False False False True
4 False False False False False False False True
5 False False False False False False False True
6 False False False False False False False True
7 False False False False False False False True
8 False False False False False False False True
9 False False False False False False False True
10 False False False False False False False True
11 False False False False False False False True
[35549 rows x 8 columns]
To select the rows where there are null values, we can use the mask as an index to subset the data:
# Use the .any method to select the rows (axis=1) with one or more NaN
surveys_df[pd.isnull(surveys_df).any(axis=1)]
Note that there are many null or NaN values in the wgt
column of our DataFrame. We will explore different ways of dealing with these later.
We can also run isnull
on a particular column. We are using the Boolean object as an index and then asking python to select rows that have a NaN
value for weight:
emptyWeights = surveys_df[pd.isnull(surveys_df).any(axis=1)]['wgt']
Test your logic
- Create a new DataFrame that only contains observations with sex values that are not female or male. Assign each sex value in the new DataFrame to a new value of ‘x’. Determine the number of null values in the subset.
- Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0. Create a stacked bar plot of average weight by plot with male vs female values stacked for each plot.