Pandas DataFrames: Data Analysis Made Easy¶
Introduction¶
Pandas is a powerful Python library for data manipulation and analysis. It provides DataFrames - table-like structures that make working with structured data intuitive and efficient.
Note
Pandas is built on top of NumPy and is the go-to library for data analysis in Python. Think of it as Excel on steroids, but programmable and much more powerful! 🐼
Installing Pandas¶
pip install pandas
Baby Steps: Creating DataFrames¶
1. From Dictionaries
import pandas as pd
# Create DataFrame from dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [20, 21, 19, 22],
'Grade': [85, 78, 92, 88]
}
df = pd.DataFrame(data)
print(df)
# Name Age Grade
# 0 Alice 20 85
# 1 Bob 21 78
# 2 Charlie 19 92
# 3 Diana 22 88
# Check data types
print(df.dtypes)
# Name object
# Age int64
# Grade int64
2. From Lists of Lists
import pandas as pd
# Data as list of lists
data = [
['Alice', 20, 85],
['Bob', 21, 78],
['Charlie', 19, 92],
['Diana', 22, 88]
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Grade'])
print(df)
3. From CSV Files
import pandas as pd
# Read from CSV file
df = pd.read_csv('students.csv')
# Save to CSV file
df.to_csv('output.csv', index=False)
Note
CSV (Comma-Separated Values) files are the most common way to store tabular data. Pandas makes reading and writing CSV files incredibly easy!
DataFrame Basics¶
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [20, 21, 19, 22, 20],
'Grade': [85, 78, 92, 88, 75],
'City': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Delhi']
}
df = pd.DataFrame(data)
# Basic information
print(df.shape) # (5, 4) - 5 rows, 4 columns
print(df.columns) # Column names
print(df.index) # Row indices
print(len(df)) # Number of rows
# First and last rows
print(df.head(3)) # First 3 rows
print(df.tail(2)) # Last 2 rows
# Statistical summary
print(df.describe()) # Statistics for numerical columns
# Information about DataFrame
print(df.info()) # Data types and memory usage
Selecting Data¶
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [20, 21, 19, 22],
'Grade': [85, 78, 92, 88],
'City': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore']
}
df = pd.DataFrame(data)
# Select single column (returns Series)
print(df['Name'])
# 0 Alice
# 1 Bob
# 2 Charlie
# 3 Diana
# Select multiple columns (returns DataFrame)
print(df[['Name', 'Grade']])
# Select rows by index
print(df.iloc[0]) # First row
print(df.iloc[1:3]) # Rows 1 and 2
# Select by label (if index has labels)
print(df.loc[0, 'Name']) # Value at row 0, column 'Name'
# Select specific cells
print(df.at[0, 'Grade']) # 85
Note
iloc uses integer positions (0, 1, 2…), while loc uses labels. For beginners, think of iloc as “integer location” and loc as “label location”.
Filtering Data¶
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [20, 21, 19, 22, 20],
'Grade': [85, 78, 92, 88, 75],
'City': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Delhi']
}
df = pd.DataFrame(data)
# Filter students with grade >= 80
high_scorers = df[df['Grade'] >= 80]
print(high_scorers)
# Multiple conditions (AND)
young_high_scorers = df[(df['Age'] < 21) & (df['Grade'] >= 85)]
print(young_high_scorers)
# Multiple conditions (OR)
mumbai_or_high = df[(df['City'] == 'Mumbai') | (df['Grade'] >= 90)]
print(mumbai_or_high)
# Filter by string methods
names_with_e = df[df['Name'].str.contains('e')]
print(names_with_e)
Note
Use & for AND, | for OR, and ~ for NOT in conditions. Always use parentheses around each condition!
Adding and Modifying Data¶
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [20, 21, 19],
'Grade': [85, 78, 92]
}
df = pd.DataFrame(data)
# Add new column
df['Pass'] = df['Grade'] >= 80
print(df)
# Add calculated column
df['Grade_Percent'] = (df['Grade'] / 100) * 100
df['Age_Group'] = df['Age'].apply(lambda x: 'Teen' if x < 20 else 'Adult')
# Modify existing column
df['Grade'] = df['Grade'] + 5 # Bonus marks!
# Add new row
new_student = {'Name': 'Diana', 'Age': 22, 'Grade': 88, 'Pass': True}
df = pd.concat([df, pd.DataFrame([new_student])], ignore_index=True)
# Drop column
df = df.drop('Grade_Percent', axis=1)
# Drop row
df = df.drop(0) # Drop first row
Sorting Data¶
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [20, 21, 19, 22],
'Grade': [85, 78, 92, 88]
}
df = pd.DataFrame(data)
# Sort by single column
df_sorted = df.sort_values('Grade', ascending=False)
print(df_sorted)
# Sort by multiple columns
df_sorted = df.sort_values(['Age', 'Grade'], ascending=[True, False])
print(df_sorted)
# Sort index
df_sorted = df.sort_index()
Grouping and Aggregation¶
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
'City': ['Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi'],
'Grade': [85, 78, 92, 88, 75, 82],
'Age': [20, 21, 19, 22, 20, 21]
}
df = pd.DataFrame(data)
# Group by city and calculate mean
city_avg = df.groupby('City')['Grade'].mean()
print(city_avg)
# City
# Delhi 82.666667
# Mumbai 84.000000
# Multiple aggregations
city_stats = df.groupby('City').agg({
'Grade': ['mean', 'min', 'max'],
'Age': 'mean'
})
print(city_stats)
# Count students per city
city_count = df.groupby('City').size()
print(city_count)
Note
groupby() is like SQL’s GROUP BY - it splits data into groups and applies functions to each group independently.
Real-World Example: Student Performance Dashboard¶
import pandas as pd
import numpy as np
# Create realistic student data
np.random.seed(42)
data = {
'Student_ID': range(1, 21),
'Name': [f'Student_{i}' for i in range(1, 21)],
'Python': np.random.randint(40, 100, 20),
'Java': np.random.randint(40, 100, 20),
'WebDev': np.random.randint(40, 100, 20),
'Database': np.random.randint(40, 100, 20),
'Attendance': np.random.randint(60, 100, 20)
}
df = pd.DataFrame(data)
# Calculate total and average
df['Total'] = df[['Python', 'Java', 'WebDev', 'Database']].sum(axis=1)
df['Average'] = df[['Python', 'Java', 'WebDev', 'Database']].mean(axis=1)
# Assign grades
def assign_grade(avg):
if avg >= 90: return 'A'
elif avg >= 75: return 'B'
elif avg >= 60: return 'C'
elif avg >= 40: return 'D'
else: return 'F'
df['Grade'] = df['Average'].apply(assign_grade)
# Find top 5 students
top_5 = df.nlargest(5, 'Average')[['Name', 'Average', 'Grade']]
print("=== Top 5 Students ===")
print(top_5)
# Subject-wise performance
print("\n=== Subject Averages ===")
subjects = ['Python', 'Java', 'WebDev', 'Database']
for subject in subjects:
avg = df[subject].mean()
print(f"{subject}: {avg:.2f}")
# Students needing attention
struggling = df[df['Average'] < 60]
print(f"\n=== Students Needing Help: {len(struggling)} ===")
print(struggling[['Name', 'Average', 'Attendance']])
Handling Missing Data¶
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [20, np.nan, 19, 22],
'Grade': [85, 78, np.nan, 88]
}
df = pd.DataFrame(data)
# Check for missing values
print(df.isnull()) # Boolean DataFrame
print(df.isnull().sum()) # Count per column
# Drop rows with any missing values
df_dropped = df.dropna()
# Fill missing values
df_filled = df.fillna(0) # Fill with 0
df_filled = df.fillna(df.mean()) # Fill with column mean
df_filled = df.fillna(method='ffill') # Forward fill
df_filled = df.fillna({'Age': 20, 'Grade': 75}) # Different values per column
Merging and Joining DataFrames¶
import pandas as pd
# Student details
students = pd.DataFrame({
'Student_ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'Diana']
})
# Student grades
grades = pd.DataFrame({
'Student_ID': [1, 2, 3, 4],
'Grade': [85, 78, 92, 88],
'Subject': ['Python', 'Python', 'Python', 'Python']
})
# Merge on Student_ID
merged = pd.merge(students, grades, on='Student_ID')
print(merged)
# Concatenate vertically (stack DataFrames)
more_students = pd.DataFrame({
'Student_ID': [5, 6],
'Name': ['Eve', 'Frank']
})
all_students = pd.concat([students, more_students], ignore_index=True)
Complete Example: Life Choices Tracker¶
import pandas as pd
import numpy as np
# Create the infamous Life Choices Tracker
data = {
'Decision': [
'Chose CS',
'Bought Expensive Course',
'Believed I Could Code',
'Started Learning AI',
'Said Yes to Group Project',
'Procrastinated Assignment',
'Pulled All-Nighter',
'Skipped Documentation',
'Used ChatGPT for Assignment',
'Promised to Exercise'
],
'Confidence_Level': [9, 8, 10, 7, 6, 3, 8, 9, 7, 10],
'Regret_Factor': [2, 7, 5, 6, 9, 10, 7, 8, 4, 9],
'Friends_Who_Warned_You': [2, 5, 1, 3, 7, 8, 6, 4, 2, 10],
'Time_Until_Realization': [365, 30, 7, 60, 2, 1, 1, 14, 3, 1]
}
df = pd.DataFrame(data)
print("=== Life Choices Tracker ===\n")
print(df)
# Calculate correlation matrix
print("\n=== Concerning Correlations ===")
correlations = df[['Confidence_Level', 'Regret_Factor',
'Friends_Who_Warned_You', 'Time_Until_Realization']].corr()
print(correlations)
# Key insights
print("\n=== Reality Checks ===")
# Correlation between confidence and regret
conf_regret_corr = df['Confidence_Level'].corr(df['Regret_Factor'])
print(f"Confidence vs Regret correlation: {conf_regret_corr:.2f}")
if conf_regret_corr < -0.3:
print("⚠️ Higher confidence = More regret (Dunning-Kruger effect detected!)")
# Decisions with high regret
high_regret = df[df['Regret_Factor'] >= 8]
print(f"\nDecisions you wish you could undo: {len(high_regret)}")
print(high_regret[['Decision', 'Regret_Factor']])
# Average time to reality check
avg_time = df['Time_Until_Realization'].mean()
print(f"\nAverage days until reality hits: {avg_time:.1f} days")
# Find most confident yet most regretted decision
df['Confidence_Regret_Gap'] = df['Confidence_Level'] - df['Regret_Factor']
worst_decision = df.loc[df['Confidence_Regret_Gap'].idxmin()]
print(f"\nWorst decision: {worst_decision['Decision']}")
print(f"Confidence: {worst_decision['Confidence_Level']}, Regret: {worst_decision['Regret_Factor']}")
# Warnings ignored
total_warnings = df['Friends_Who_Warned_You'].sum()
print(f"\nTotal friend warnings ignored: {total_warnings}")
print("(Maybe listen to your friends? Just a thought.)")
Advanced: String Operations¶
import pandas as pd
data = {
'Name': ['Alice Smith', 'Bob JONES', 'charlie brown'],
'Email': ['alice@email.com', 'BOB@EMAIL.COM', 'charlie@email.com']
}
df = pd.DataFrame(data)
# String methods
df['Name_Upper'] = df['Name'].str.upper()
df['Name_Lower'] = df['Name'].str.lower()
df['First_Name'] = df['Name'].str.split().str[0]
df['Email_Domain'] = df['Email'].str.split('@').str[1]
df['Name_Length'] = df['Name'].str.len()
print(df)
Tasks¶
Task 1: Personal Life Choices Analyzer
Create a DataFrame called LifeChoicesTracker with at least 10 rows containing: Decision, Confidence_Level (1-10), Regret_Factor (1-10), Friends_Who_Warned_You, Time_Until_Realization (days). Calculate: (a) Correlation between all numerical columns, (b) Average regret factor, (c) Decisions where Confidence > 7 but Regret > 7, (d) Total warnings ignored.
Hint: Use df.corr() for correlation matrix. Use filtering: df[(df['Confidence_Level'] > 7) & (df['Regret_Factor'] > 7)].
Task 2: Student Grade Management System
Create a DataFrame with 15 students having: Name, Roll_No, Python, Java, WebDev, Database marks. Calculate: (a) Total and Average for each student, (b) Assign grades (A/B/C/D/F), (c) Find top 3 students, (d) Calculate subject-wise averages, (e) Find students below 60% average.
Hint: Add columns using df['Total'] = df[['Python', 'Java', 'WebDev', 'Database']].sum(axis=1). Use nlargest() for top students.
Task 3: Time Tracking Reality Check
Create a DataFrame with activities: Activity, Claimed_Hours, Actual_Hours for activities like Studying, Gaming, Social Media, Sleep, Productive Procrastination. Calculate: (a) Self-deception margin (Claimed - Actual), (b) Total claimed vs actual hours, (c) Activities where Claimed > Actual, (d) Biggest self-deception.
Hint: Add column: df['Deception'] = df['Claimed_Hours'] - df['Actual_Hours']. Use idxmax() to find biggest deception.
Task 4: Multi-City Student Performance
Create a DataFrame with 20 students from 4 cities with Grade and Attendance. Use groupby() to calculate: (a) Average grade per city, (b) Average attendance per city, (c) City with highest performance, (d) Count of students per city, (e) Find city where attendance is high but grades are low.
Hint: Use groupby('City').agg({'Grade': 'mean', 'Attendance': 'mean'}). Compare rankings of cities by grade vs attendance.
Task 5: Semester Progress Tracker
Create a DataFrame with Week (1-18), Study_Hours, Assignment_Completion (%), Motivation_Level (1-10), Coffee_Consumed (cups). Analyze: (a) Correlation between study hours and assignment completion, (b) Week with highest/lowest motivation, (c) Average coffee consumption, (d) Weeks where motivation < 5, (e) Create a “Productivity Score” = (Study_Hours × Assignment_Completion) / (Coffee_Consumed + 1).
Hint: Use df['Week'].loc[df['Motivation_Level'].idxmin()] to find week with lowest motivation. Calculate correlation: df['Study_Hours'].corr(df['Assignment_Completion']).
Summary¶
Pandas DataFrames are table-like structures for data analysis
Create DataFrames from dictionaries, lists, or CSV files
Select data using
[],iloc,locindexingFilter data using boolean conditions
Add/modify columns and rows easily
Sort data using
sort_values()Group data using
groupby()for aggregationsHandle missing data with
dropna(),fillna()Merge DataFrames using
merge()andconcat()Calculate correlations using
corr()Pandas builds on NumPy and is essential for data analysis