Pandas Tutorial
Creating Objects
Viewing Data
Selection
Manipulating Data
Grouping Data
Merging, Joining and Concatenating
Working with Date and Time
Working With Text Data
Working with CSV and Excel files
Operations
Visualization
Applications and Projects
Aggregation is the process of combining multiple values into a single value, and it's one of the most common data manipulation tasks. Pandas provides several methods to facilitate this.
In this tutorial, we will go through various aggregation techniques across one or more columns using the pandas library.
Ensure you have pandas:
pip install pandas
import pandas as pd
data = { 'A': ['foo', 'foo', 'foo', 'bar', 'bar'], 'B': ['one', 'one', 'two', 'two', 'one'], 'C': [1, 2, 3, 4, 5], 'D': [10, 20, 30, 40, 50] } df = pd.DataFrame(data) print(df)
The groupby
method is the primary aggregation method in pandas.
To find the sum of column 'C' for each unique value in column 'A':
print(df.groupby('A')['C'].sum())
To aggregate multiple columns, just pass a list:
print(df.groupby('A')[['C', 'D']].sum())
You can group by multiple columns:
print(df.groupby(['A', 'B']).sum())
Pandas allows multiple aggregation functions:
agg()
method:The agg
method provides flexibility to aggregate using multiple functions at once:
print(df.groupby('A').agg({ 'C': ['sum', 'mean', 'max'], 'D': ['mean', 'std'] }))
You can directly use methods like mean()
, sum()
, max()
, min()
, etc. on groupby objects:
print(df.groupby('A')['C'].mean())
You can also use custom aggregation functions:
def custom_agg(series): return (series.max() - series.min()) print(df.groupby('A')['C'].agg(custom_agg))
Grouping often results in a MultiIndex or a changed index in the result. If you want a flat table as output, use the reset_index
method:
print(df.groupby('A')[['C', 'D']].sum().reset_index())
This feature allows clearer naming of the resulting columns:
result = df.groupby('A').agg( max_C=pd.NamedAgg(column='C', aggfunc='max'), sum_D=pd.NamedAgg(column='D', aggfunc='sum') ) print(result)
Aggregating data in pandas is straightforward thanks to the powerful and flexible groupby
mechanism. From simple operations to complex custom aggregations, you can handle a wide range of scenarios efficiently.
Aggregating data by column in Pandas:
groupby()
function and apply aggregation functions.import pandas as pd data = {'Category': ['A', 'B', 'A', 'B', 'A'], 'Value': [10, 15, 20, 25, 30]} df = pd.DataFrame(data) # Aggregating by 'Category' column aggregated_data = df.groupby('Category').sum()
GroupBy and aggregate in Pandas:
groupby()
function along with aggregation functions like sum()
, mean()
, etc.aggregated_data = df.groupby('Category')['Value'].sum()
Aggregate multiple columns in Pandas DataFrame:
agg()
function.aggregated_data = df.groupby('Category').agg({'Value': 'sum', 'OtherColumn': 'mean'})
Aggregation functions in Pandas:
sum()
, mean()
, median()
, etc.sum_value = df['Value'].sum() mean_value = df['Value'].mean() median_value = df['Value'].median()
Pandas sum by column and group:
sum_by_group = df.groupby(['Category', 'OtherColumn'])['Value'].sum()
Aggregate mean and median in Pandas by column:
mean_and_median = df.groupby('Category')['Value'].agg(['mean', 'median'])
Aggregating data by multiple columns in Pandas:
aggregated_data = df.groupby(['Category', 'OtherColumn']).sum()
Custom aggregation functions in Pandas:
def custom_agg_function(values): # Custom aggregation logic return values.max() - values.min() custom_aggregation = df.groupby('Category')['Value'].agg(custom_agg_function)
Rolling window aggregation in Pandas:
rolling()
method.rolling_sum = df['Value'].rolling(window=3).sum()