Pandas Tutorial
Creating Objects
Viewing Data
Selection
Manipulating Data
Grouping Data
Merging, Joining and Concatenating
Working with Date and Time
Working With Text Data
Working with CSV and Excel files
Operations
Visualization
Applications and Projects
The groupby
method in pandas is a powerful tool for segmenting a DataFrame into subsets according to some criteria. It's particularly useful for aggregating data, computing summary statistics, and restructuring data in various ways.
Here's a step-by-step tutorial on using groupby
in pandas:
Ensure you have the required libraries:
pip install pandas
import pandas as pd
For this tutorial, let's create a sample DataFrame:
data = { 'Department': ['IT', 'HR', 'Finance', 'IT', 'HR'], 'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'Salary': [55000, 60000, 65000, 58000, 62000] } df = pd.DataFrame(data)
To group data by the 'Department' column:
grouped = df.groupby('Department')
This creates a GroupBy
object. It hasn't actually computed anything yet, but it has some useful methods and attributes.
Once grouped, we can aggregate data in various ways:
mean_salaries = grouped['Salary'].mean() print(mean_salaries)
Output:
Department Finance 65000 HR 61000 IT 56500 Name: Salary, dtype: int64
aggregations = grouped['Salary'].agg(['mean', 'sum', 'max', 'min']) print(aggregations)
result = grouped.agg({ 'Salary': ['mean', 'sum'], 'Employee': 'count' }) print(result)
You can iterate over each group in a GroupBy
object:
for department, group_data in grouped: print(department) print(group_data, '\n')
Suppose we want to filter groups based on some criteria:
# Filter departments with average salary greater than 60000 filtered = grouped.filter(lambda x: x['Salary'].mean() > 60000) print(filtered)
You can transform the values in each group:
# Deduct 5000 from each salary in the 'IT' department deducted_salary = grouped.transform(lambda x: x['Salary'] - 5000 if x.name == 'IT' else x['Salary']) print(deducted_salary)
You can group by multiple columns:
df['Experience'] = ['Senior', 'Junior', 'Senior', 'Junior', 'Senior'] grouped_multi = df.groupby(['Department', 'Experience']) # Compute mean salary mean_salaries_multi = grouped_multi['Salary'].mean() print(mean_salaries_multi)
The groupby
method in pandas is versatile and provides numerous options for data aggregation, transformation, and filtering. It plays a pivotal role in exploratory data analysis and preprocessing, helping generate insights and prepare data for further analysis or visualization.
GroupBy in Pandas with examples:
groupby()
function in Pandas is used for grouping rows based on specified columns.import pandas as pd # Create a DataFrame df = pd.DataFrame({'Category': ['A', 'B', 'A', 'B'], 'Value': [10, 15, 20, 25]}) # GroupBy 'Category' grouped = df.groupby('Category')
Aggregate functions in Pandas GroupBy:
sum()
, mean()
, count()
, etc., can be applied.# Aggregate using sum sum_values = grouped['Value'].sum()
Pandas GroupBy multiple columns:
groupby()
function.# GroupBy multiple columns grouped_multiple = df.groupby(['Category', 'AnotherColumn'])
GroupBy and sum in Pandas:
sum()
function to get the sum of values within each group.# GroupBy and sum sum_values = grouped['Value'].sum()
Pandas GroupBy count unique values:
nunique()
function counts the number of unique values within each group.# GroupBy and count unique values unique_counts = grouped['Value'].nunique()
How to reset index after GroupBy in Pandas:
reset_index()
to move grouped columns back to DataFrame columns.# Reset index after GroupBy result = sum_values.reset_index()
GroupBy and apply function in Pandas:
apply()
function after grouping.# GroupBy and apply custom function result = grouped['Value'].apply(custom_function)
Pandas GroupBy mean and median:
mean()
and median()
functions.# GroupBy and mean mean_values = grouped['Value'].mean() # GroupBy and median median_values = grouped['Value'].median()
GroupBy and filter in Pandas:
filter()
to filter groups based on specified conditions.# GroupBy and filter filtered_groups = grouped.filter(lambda x: x['Value'].sum() > 30)