Pandas Tutorial

Creating Objects

Viewing Data

Selection

Manipulating Data

Grouping Data

Merging, Joining and Concatenating

Working with Date and Time

Working With Text Data

Working with CSV and Excel files

Operations

Visualization

Applications and Projects

Aggregation data across one or more column in Pandas

Aggregation is the process of combining multiple values into a single value, and it's one of the most common data manipulation tasks. Pandas provides several methods to facilitate this.

In this tutorial, we will go through various aggregation techniques across one or more columns using the pandas library.

1. Setup:

Ensure you have pandas:

pip install pandas

2. Import Necessary Libraries:

import pandas as pd

3. Create a Sample DataFrame:

data = {
    'A': ['foo', 'foo', 'foo', 'bar', 'bar'],
    'B': ['one', 'one', 'two', 'two', 'one'],
    'C': [1, 2, 3, 4, 5],
    'D': [10, 20, 30, 40, 50]
}

df = pd.DataFrame(data)
print(df)

4. GroupBy Aggregation:

The groupby method is the primary aggregation method in pandas.

a. Single Column Aggregation:

To find the sum of column 'C' for each unique value in column 'A':

print(df.groupby('A')['C'].sum())

b. Multiple Column Aggregation:

To aggregate multiple columns, just pass a list:

print(df.groupby('A')[['C', 'D']].sum())

c. Multiple Column Grouping:

You can group by multiple columns:

print(df.groupby(['A', 'B']).sum())

5. Aggregation using Different Functions:

Pandas allows multiple aggregation functions:

a. Using agg() method:

The agg method provides flexibility to aggregate using multiple functions at once:

print(df.groupby('A').agg({
    'C': ['sum', 'mean', 'max'],
    'D': ['mean', 'std']
}))

b. Built-in functions:

You can directly use methods like mean(), sum(), max(), min(), etc. on groupby objects:

print(df.groupby('A')['C'].mean())

6. Custom Aggregation:

You can also use custom aggregation functions:

def custom_agg(series):
    return (series.max() - series.min())

print(df.groupby('A')['C'].agg(custom_agg))

7. Resetting Index after GroupBy:

Grouping often results in a MultiIndex or a changed index in the result. If you want a flat table as output, use the reset_index method:

print(df.groupby('A')[['C', 'D']].sum().reset_index())

8. Named Aggregations (since Pandas 0.25.0):

This feature allows clearer naming of the resulting columns:

result = df.groupby('A').agg(
    max_C=pd.NamedAgg(column='C', aggfunc='max'),
    sum_D=pd.NamedAgg(column='D', aggfunc='sum')
)
print(result)

Summary:

Aggregating data in pandas is straightforward thanks to the powerful and flexible groupby mechanism. From simple operations to complex custom aggregations, you can handle a wide range of scenarios efficiently.

  1. Aggregating data by column in Pandas:

    • Aggregate data based on specific columns using the groupby() function and apply aggregation functions.
    import pandas as pd
    
    data = {'Category': ['A', 'B', 'A', 'B', 'A'],
            'Value': [10, 15, 20, 25, 30]}
    
    df = pd.DataFrame(data)
    
    # Aggregating by 'Category' column
    aggregated_data = df.groupby('Category').sum()
    
  2. GroupBy and aggregate in Pandas:

    • Use the groupby() function along with aggregation functions like sum(), mean(), etc.
    aggregated_data = df.groupby('Category')['Value'].sum()
    
  3. Aggregate multiple columns in Pandas DataFrame:

    • Aggregate multiple columns simultaneously using the agg() function.
    aggregated_data = df.groupby('Category').agg({'Value': 'sum', 'OtherColumn': 'mean'})
    
  4. Aggregation functions in Pandas:

    • Explore various aggregation functions like sum(), mean(), median(), etc.
    sum_value = df['Value'].sum()
    mean_value = df['Value'].mean()
    median_value = df['Value'].median()
    
  5. Pandas sum by column and group:

    • Sum values based on specific columns and groups.
    sum_by_group = df.groupby(['Category', 'OtherColumn'])['Value'].sum()
    
  6. Aggregate mean and median in Pandas by column:

    • Calculate both mean and median for a specific column.
    mean_and_median = df.groupby('Category')['Value'].agg(['mean', 'median'])
    
  7. Aggregating data by multiple columns in Pandas:

    • Aggregate data using multiple columns for more granular grouping.
    aggregated_data = df.groupby(['Category', 'OtherColumn']).sum()
    
  8. Custom aggregation functions in Pandas:

    • Define and apply custom aggregation functions.
    def custom_agg_function(values):
        # Custom aggregation logic
        return values.max() - values.min()
    
    custom_aggregation = df.groupby('Category')['Value'].agg(custom_agg_function)
    
  9. Rolling window aggregation in Pandas:

    • Perform rolling window aggregations using the rolling() method.
    rolling_sum = df['Value'].rolling(window=3).sum()