Pandas Tutorial
Creating Objects
Viewing Data
Selection
Manipulating Data
Grouping Data
Merging, Joining and Concatenating
Working with Date and Time
Working With Text Data
Working with CSV and Excel files
Operations
Visualization
Applications and Projects
Boolean indexing is a powerful feature in pandas that allows you to filter data from a DataFrame or Series based on a condition or a set of conditions. It's a critical tool in any data analyst's toolbox. This tutorial will guide you through using boolean indexing in pandas.
First, let's set up the environment and create a sample DataFrame:
import pandas as pd # Sample DataFrame data = { 'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50], 'C': ['p', 'q', 'r', 's', 't'] } df = pd.DataFrame(data) print("Original DataFrame:") print(df)
Filter rows where values in column 'A' are greater than 3:
filtered_df = df[df['A'] > 3] print("\nFiltered DataFrame (A > 3):") print(filtered_df)
Filter rows where values in column 'A' are greater than 2 and values in column 'B' are less than 40:
filtered_df = df[(df['A'] > 2) & (df['B'] < 40)] print("\nFiltered DataFrame (2 < A and B < 40):") print(filtered_df)
Note: Always use &
(and), |
(or), and ~
(not) with parentheses around each condition when combining conditions.
isin()
:If you want to filter data based on a list of values:
values = ['p', 's'] filtered_df = df[df['C'].isin(values)] print("\nFiltered DataFrame (C in ['p', 's']):") print(filtered_df)
~
for Negation:To select rows where column 'C' is NOT in the list of values:
values = ['p', 's'] filtered_df = df[~df['C'].isin(values)] print("\nFiltered DataFrame (C not in ['p', 's']):") print(filtered_df)
You can combine boolean indexing with other DataFrame operations:
count = (df[df['A'] > 2]).shape[0] print(f"\nNumber of rows where A > 2: {count}")
mean_val = df[df['A'] > 2]['B'].mean() print(f"\nMean of column 'B' where A > 2: {mean_val}")
Combining all the steps, you'll get:
import pandas as pd # Sample DataFrame data = { 'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50], 'C': ['p', 'q', 'r', 's', 't'] } df = pd.DataFrame(data) print("Original DataFrame:") print(df) # Boolean Indexing print("\nFiltered DataFrame (A > 3):") print(df[df['A'] > 3]) print("\nFiltered DataFrame (2 < A and B < 40):") print(df[(df['A'] > 2) & (df['B'] < 40)]) print("\nFiltered DataFrame (C in ['p', 's']):") print(df[df['C'].isin(['p', 's'])]) print("\nFiltered DataFrame (C not in ['p', 's']):") print(df[~df['C'].isin(['p', 's'])]) print(f"\nNumber of rows where A > 2: {(df[df['A'] > 2]).shape[0]}") print(f"\nMean of column 'B' where A > 2: {df[df['A'] > 2]['B'].mean()}")
This tutorial offers a foundational understanding of boolean indexing in pandas. It's a versatile tool that can be combined with other functions and methods for more complex data manipulations.
Python Pandas boolean indexing examples:
import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Boolean mask for filtering mask = df['A'] > 2 # Apply boolean mask to filter data filtered_data = df[mask]
Filtering data with boolean conditions in Pandas:
import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Filter data using boolean conditions filtered_data = df[df['A'] > 2]
Indexing and selecting data with boolean arrays in Pandas:
import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Create a boolean array bool_array = [True, False, True, False, True] # Select data using boolean array selected_data = df[bool_array]
Applying multiple boolean conditions to Pandas DataFrame:
import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Multiple boolean conditions condition1 = df['A'] > 2 condition2 = df['B'] == 'X' # Combine conditions using logical operators combined_condition = condition1 & condition2 # Apply combined condition to filter data filtered_data = df[combined_condition]
Creating boolean masks for advanced data selection in Pandas:
import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Create boolean masks mask1 = df['A'] > 2 mask2 = df['B'] == 'X' # Combine masks using logical operators combined_mask = mask1 & mask2 # Apply combined mask to filter data filtered_data = df[combined_mask]
Combining boolean indexing with other Pandas operations:
import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Boolean mask for filtering mask = df['A'] > 2 # Select and perform operations on filtered data df.loc[mask, 'B'] = 'Z'
Boolean indexing for missing data handling in Pandas:
import pandas as pd # Sample DataFrame with missing values df = pd.DataFrame({'A': [1, 2, 3, None, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Boolean mask for missing values mask = df['A'].isna() # Replace missing values based on the boolean mask df.loc[mask, 'A'] = 0
Using boolean indexing with categorical data in Pandas:
import pandas as pd # Sample DataFrame with categorical column df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) df['B'] = df['B'].astype('category') # Boolean mask for categorical values mask = df['B'] == 'X' # Apply boolean mask to filter data filtered_data = df[mask]
Efficient boolean indexing techniques in Pandas:
import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Efficient boolean indexing using query method filtered_data = df.query('A > 2 and B == "X"')
Pandas boolean indexing vs. traditional indexing:
import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Traditional indexing traditional_data = df[df['A'] > 2] # Boolean indexing boolean_data = df.query('A > 2')
Applying boolean indexing to time-series data in Pandas:
import pandas as pd import datetime # Sample DataFrame with time-series data df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}, index=pd.date_range('2022-01-01', periods=5, freq='D')) # Boolean mask for time-based filtering mask = df.index > datetime.datetime(2022, 1, 3) # Apply boolean mask to filter time-series data filtered_data = df[mask]
Code examples for effective boolean indexing in Pandas:
import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['X', 'Y', 'X', 'Y', 'X']}) # Boolean mask for filtering mask = (df['A'] > 2) & (df['B'] == 'X') # Apply boolean mask to filter data filtered_data = df[mask]