Pandas Tutorial
Creating Objects
Viewing Data
Selection
Manipulating Data
Grouping Data
Merging, Joining and Concatenating
Working with Date and Time
Working With Text Data
Working with CSV and Excel files
Operations
Visualization
Applications and Projects
Working with text data (also known as string data) is a common task in data science and analytics. Pandas provides robust support for working with text data through the .str
accessor, which allows you to apply string methods on Series and Index objects.
Here's a concise tutorial to get you started:
1. Set Up Environment and Libraries:
import pandas as pd
2. Sample DataFrame:
data = { 'Name': ['Alice Smith', 'Bob Johnson', 'Charlie Brown', 'David Williams'], 'Email': ['alice@email.com', 'bob@email.com', None, 'david@email.com'] } df = pd.DataFrame(data) print(df)
3. Basic String Operations:
a. Lowercasing:
df['Name'] = df['Name'].str.lower() print(df)
b. Uppercasing:
df['Name'] = df['Name'].str.upper() print(df)
c. Title Case:
df['Name'] = df['Name'].str.title() print(df)
d. String Length:
df['Name Length'] = df['Name'].str.len() print(df)
4. Splitting and Replacing Strings:
a. Splitting Strings:
# Splitting on space df['First Name'] = df['Name'].str.split().str[0] df['Last Name'] = df['Name'].str.split().str[1] print(df)
b. Replacing Text:
df['Name'] = df['Name'].str.replace('Brown', 'Green') print(df)
5. Checking for Strings:
a. Contains:
df['Is_Johnson'] = df['Name'].str.contains('Johnson') print(df)
b. Starts With and Ends With:
df['Starts_With_D'] = df['Name'].str.startswith('David') print(df)
6. Handling Missing Data:
a. Fill Missing Data:
df['Email'].fillna('missing@email.com', inplace=True) print(df)
b. Check for NaN:
df['Email_Missing'] = df['Email'].isna() print(df)
7. Extracting Substrings:
a. Using Regular Expressions:
df['Domain'] = df['Email'].str.extract(r'@(\w+\.\w+)') print(df)
8. Stripping White Spaces:
df['Name'] = df['Name'].str.strip()
This is just the tip of the iceberg, and there are many more functionalities provided by the .str
accessor in Pandas. The best way to learn is to experiment with various methods and apply them to real-world data scenarios.
It's also worth noting that when dealing with large datasets, some string operations might be slow. In such cases, there are more advanced techniques and tools like Dask or Vaex that can be used to speed up the process.
Working with strings in Pandas Series:
.str
accessor.import pandas as pd # Create Series with strings series = pd.Series(['apple', 'banana', 'cherry']) # Uppercase the strings uppercase_series = series.str.upper()
Text data manipulation in Pandas DataFrame:
import pandas as pd # Create DataFrame with text columns df = pd.DataFrame({'name': ['John', 'Alice', 'Bob'], 'city': ['New York', 'London', 'Paris']}) # Extract first letter from 'name' df['first_letter'] = df['name'].str[0]
String methods in Pandas for text analysis:
.str.len()
and .str.contains()
.import pandas as pd # Create DataFrame with text column df = pd.DataFrame({'text': ['apple', 'banana', 'cherry']}) # Calculate length of each string df['length'] = df['text'].str.len() # Check if 'banana' is present in each string df['contains_banana'] = df['text'].str.contains('banana')
Cleaning and preprocessing text data with Pandas:
import pandas as pd # Create DataFrame with text column df = pd.DataFrame({'text': ['apple!', ' banana ', 'Cherry.']}) # Remove punctuation and leading/trailing whitespaces df['cleaned_text'] = df['text'].str.replace('[^\w\s]', '').str.strip()
Handling missing values in text data using Pandas:
.fillna()
method.import pandas as pd # Create DataFrame with missing values in text column df = pd.DataFrame({'text': ['apple', None, 'cherry']}) # Fill missing values with a default string df['text'] = df['text'].fillna('unknown')
Pandas str accessor for text operations:
.str
accessor for efficient text operations on Pandas Series.import pandas as pd # Create Series with strings series = pd.Series(['apple', 'banana', 'cherry']) # Extract first two characters from each string first_two_chars = series.str[:2]
Extracting information from text columns in Pandas:
.str.extract()
method.import pandas as pd # Create DataFrame with text column df = pd.DataFrame({'text': ['apple 30', 'banana 25', 'cherry 40']}) # Extract numbers from each string df['numbers'] = df['text'].str.extract('(\d+)')
Tokenization and word processing in Pandas:
.str.split()
method.import pandas as pd # Create Series with sentences series = pd.Series(['I love pandas', 'Data analysis is fun', 'Python is great']) # Tokenize sentences into words words = series.str.split()
Regular expressions for text data in Pandas:
import pandas as pd # Create Series with strings series = pd.Series(['apple', 'banana', 'cherry']) # Filter strings starting with 'a' or 'b' filtered_strings = series[series.str.contains('^[ab]')]