Pandas DataFrames: Data Analysis Made Easy

Introduction

Pandas is a powerful Python library for data manipulation and analysis. It provides DataFrames - table-like structures that make working with structured data intuitive and efficient.

Note

Pandas is built on top of NumPy and is the go-to library for data analysis in Python. Think of it as Excel on steroids, but programmable and much more powerful! 🐼


Installing Pandas

pip install pandas

Baby Steps: Creating DataFrames

1. From Dictionaries

import pandas as pd

# Create DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [20, 21, 19, 22],
    'Grade': [85, 78, 92, 88]
}

df = pd.DataFrame(data)
print(df)
#       Name  Age  Grade
# 0    Alice   20     85
# 1      Bob   21     78
# 2  Charlie   19     92
# 3    Diana   22     88

# Check data types
print(df.dtypes)
# Name     object
# Age       int64
# Grade     int64

2. From Lists of Lists

import pandas as pd

# Data as list of lists
data = [
    ['Alice', 20, 85],
    ['Bob', 21, 78],
    ['Charlie', 19, 92],
    ['Diana', 22, 88]
]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Grade'])
print(df)

3. From CSV Files

import pandas as pd

# Read from CSV file
df = pd.read_csv('students.csv')

# Save to CSV file
df.to_csv('output.csv', index=False)

Note

CSV (Comma-Separated Values) files are the most common way to store tabular data. Pandas makes reading and writing CSV files incredibly easy!


DataFrame Basics

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [20, 21, 19, 22, 20],
    'Grade': [85, 78, 92, 88, 75],
    'City': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Delhi']
}

df = pd.DataFrame(data)

# Basic information
print(df.shape)          # (5, 4) - 5 rows, 4 columns
print(df.columns)        # Column names
print(df.index)          # Row indices
print(len(df))           # Number of rows

# First and last rows
print(df.head(3))        # First 3 rows
print(df.tail(2))        # Last 2 rows

# Statistical summary
print(df.describe())     # Statistics for numerical columns

# Information about DataFrame
print(df.info())         # Data types and memory usage

Selecting Data

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [20, 21, 19, 22],
    'Grade': [85, 78, 92, 88],
    'City': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore']
}

df = pd.DataFrame(data)

# Select single column (returns Series)
print(df['Name'])
# 0      Alice
# 1        Bob
# 2    Charlie
# 3      Diana

# Select multiple columns (returns DataFrame)
print(df[['Name', 'Grade']])

# Select rows by index
print(df.iloc[0])        # First row
print(df.iloc[1:3])      # Rows 1 and 2

# Select by label (if index has labels)
print(df.loc[0, 'Name']) # Value at row 0, column 'Name'

# Select specific cells
print(df.at[0, 'Grade']) # 85

Note

iloc uses integer positions (0, 1, 2…), while loc uses labels. For beginners, think of iloc as “integer location” and loc as “label location”.


Filtering Data

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [20, 21, 19, 22, 20],
    'Grade': [85, 78, 92, 88, 75],
    'City': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Delhi']
}

df = pd.DataFrame(data)

# Filter students with grade >= 80
high_scorers = df[df['Grade'] >= 80]
print(high_scorers)

# Multiple conditions (AND)
young_high_scorers = df[(df['Age'] < 21) & (df['Grade'] >= 85)]
print(young_high_scorers)

# Multiple conditions (OR)
mumbai_or_high = df[(df['City'] == 'Mumbai') | (df['Grade'] >= 90)]
print(mumbai_or_high)

# Filter by string methods
names_with_e = df[df['Name'].str.contains('e')]
print(names_with_e)

Note

Use & for AND, | for OR, and ~ for NOT in conditions. Always use parentheses around each condition!


Adding and Modifying Data

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [20, 21, 19],
    'Grade': [85, 78, 92]
}

df = pd.DataFrame(data)

# Add new column
df['Pass'] = df['Grade'] >= 80
print(df)

# Add calculated column
df['Grade_Percent'] = (df['Grade'] / 100) * 100
df['Age_Group'] = df['Age'].apply(lambda x: 'Teen' if x < 20 else 'Adult')

# Modify existing column
df['Grade'] = df['Grade'] + 5  # Bonus marks!

# Add new row
new_student = {'Name': 'Diana', 'Age': 22, 'Grade': 88, 'Pass': True}
df = pd.concat([df, pd.DataFrame([new_student])], ignore_index=True)

# Drop column
df = df.drop('Grade_Percent', axis=1)

# Drop row
df = df.drop(0)  # Drop first row

Sorting Data

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [20, 21, 19, 22],
    'Grade': [85, 78, 92, 88]
}

df = pd.DataFrame(data)

# Sort by single column
df_sorted = df.sort_values('Grade', ascending=False)
print(df_sorted)

# Sort by multiple columns
df_sorted = df.sort_values(['Age', 'Grade'], ascending=[True, False])
print(df_sorted)

# Sort index
df_sorted = df.sort_index()

Grouping and Aggregation

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'City': ['Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi'],
    'Grade': [85, 78, 92, 88, 75, 82],
    'Age': [20, 21, 19, 22, 20, 21]
}

df = pd.DataFrame(data)

# Group by city and calculate mean
city_avg = df.groupby('City')['Grade'].mean()
print(city_avg)
# City
# Delhi     82.666667
# Mumbai    84.000000

# Multiple aggregations
city_stats = df.groupby('City').agg({
    'Grade': ['mean', 'min', 'max'],
    'Age': 'mean'
})
print(city_stats)

# Count students per city
city_count = df.groupby('City').size()
print(city_count)

Note

groupby() is like SQL’s GROUP BY - it splits data into groups and applies functions to each group independently.


Real-World Example: Student Performance Dashboard

import pandas as pd
import numpy as np

# Create realistic student data
np.random.seed(42)

data = {
    'Student_ID': range(1, 21),
    'Name': [f'Student_{i}' for i in range(1, 21)],
    'Python': np.random.randint(40, 100, 20),
    'Java': np.random.randint(40, 100, 20),
    'WebDev': np.random.randint(40, 100, 20),
    'Database': np.random.randint(40, 100, 20),
    'Attendance': np.random.randint(60, 100, 20)
}

df = pd.DataFrame(data)

# Calculate total and average
df['Total'] = df[['Python', 'Java', 'WebDev', 'Database']].sum(axis=1)
df['Average'] = df[['Python', 'Java', 'WebDev', 'Database']].mean(axis=1)

# Assign grades
def assign_grade(avg):
    if avg >= 90: return 'A'
    elif avg >= 75: return 'B'
    elif avg >= 60: return 'C'
    elif avg >= 40: return 'D'
    else: return 'F'

df['Grade'] = df['Average'].apply(assign_grade)

# Find top 5 students
top_5 = df.nlargest(5, 'Average')[['Name', 'Average', 'Grade']]
print("=== Top 5 Students ===")
print(top_5)

# Subject-wise performance
print("\n=== Subject Averages ===")
subjects = ['Python', 'Java', 'WebDev', 'Database']
for subject in subjects:
    avg = df[subject].mean()
    print(f"{subject}: {avg:.2f}")

# Students needing attention
struggling = df[df['Average'] < 60]
print(f"\n=== Students Needing Help: {len(struggling)} ===")
print(struggling[['Name', 'Average', 'Attendance']])

Handling Missing Data

import pandas as pd
import numpy as np

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [20, np.nan, 19, 22],
    'Grade': [85, 78, np.nan, 88]
}

df = pd.DataFrame(data)

# Check for missing values
print(df.isnull())       # Boolean DataFrame
print(df.isnull().sum()) # Count per column

# Drop rows with any missing values
df_dropped = df.dropna()

# Fill missing values
df_filled = df.fillna(0)                    # Fill with 0
df_filled = df.fillna(df.mean())            # Fill with column mean
df_filled = df.fillna(method='ffill')       # Forward fill
df_filled = df.fillna({'Age': 20, 'Grade': 75})  # Different values per column

Merging and Joining DataFrames

import pandas as pd

# Student details
students = pd.DataFrame({
    'Student_ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana']
})

# Student grades
grades = pd.DataFrame({
    'Student_ID': [1, 2, 3, 4],
    'Grade': [85, 78, 92, 88],
    'Subject': ['Python', 'Python', 'Python', 'Python']
})

# Merge on Student_ID
merged = pd.merge(students, grades, on='Student_ID')
print(merged)

# Concatenate vertically (stack DataFrames)
more_students = pd.DataFrame({
    'Student_ID': [5, 6],
    'Name': ['Eve', 'Frank']
})
all_students = pd.concat([students, more_students], ignore_index=True)

Complete Example: Life Choices Tracker

import pandas as pd
import numpy as np

# Create the infamous Life Choices Tracker
data = {
    'Decision': [
        'Chose CS',
        'Bought Expensive Course',
        'Believed I Could Code',
        'Started Learning AI',
        'Said Yes to Group Project',
        'Procrastinated Assignment',
        'Pulled All-Nighter',
        'Skipped Documentation',
        'Used ChatGPT for Assignment',
        'Promised to Exercise'
    ],
    'Confidence_Level': [9, 8, 10, 7, 6, 3, 8, 9, 7, 10],
    'Regret_Factor': [2, 7, 5, 6, 9, 10, 7, 8, 4, 9],
    'Friends_Who_Warned_You': [2, 5, 1, 3, 7, 8, 6, 4, 2, 10],
    'Time_Until_Realization': [365, 30, 7, 60, 2, 1, 1, 14, 3, 1]
}

df = pd.DataFrame(data)

print("=== Life Choices Tracker ===\n")
print(df)

# Calculate correlation matrix
print("\n=== Concerning Correlations ===")
correlations = df[['Confidence_Level', 'Regret_Factor',
                    'Friends_Who_Warned_You', 'Time_Until_Realization']].corr()
print(correlations)

# Key insights
print("\n=== Reality Checks ===")

# Correlation between confidence and regret
conf_regret_corr = df['Confidence_Level'].corr(df['Regret_Factor'])
print(f"Confidence vs Regret correlation: {conf_regret_corr:.2f}")
if conf_regret_corr < -0.3:
    print("⚠️  Higher confidence = More regret (Dunning-Kruger effect detected!)")

# Decisions with high regret
high_regret = df[df['Regret_Factor'] >= 8]
print(f"\nDecisions you wish you could undo: {len(high_regret)}")
print(high_regret[['Decision', 'Regret_Factor']])

# Average time to reality check
avg_time = df['Time_Until_Realization'].mean()
print(f"\nAverage days until reality hits: {avg_time:.1f} days")

# Find most confident yet most regretted decision
df['Confidence_Regret_Gap'] = df['Confidence_Level'] - df['Regret_Factor']
worst_decision = df.loc[df['Confidence_Regret_Gap'].idxmin()]
print(f"\nWorst decision: {worst_decision['Decision']}")
print(f"Confidence: {worst_decision['Confidence_Level']}, Regret: {worst_decision['Regret_Factor']}")

# Warnings ignored
total_warnings = df['Friends_Who_Warned_You'].sum()
print(f"\nTotal friend warnings ignored: {total_warnings}")
print("(Maybe listen to your friends? Just a thought.)")

Advanced: String Operations

import pandas as pd

data = {
    'Name': ['Alice Smith', 'Bob JONES', 'charlie brown'],
    'Email': ['alice@email.com', 'BOB@EMAIL.COM', 'charlie@email.com']
}

df = pd.DataFrame(data)

# String methods
df['Name_Upper'] = df['Name'].str.upper()
df['Name_Lower'] = df['Name'].str.lower()
df['First_Name'] = df['Name'].str.split().str[0]
df['Email_Domain'] = df['Email'].str.split('@').str[1]
df['Name_Length'] = df['Name'].str.len()

print(df)

Tasks

Task 1: Personal Life Choices Analyzer

Create a DataFrame called LifeChoicesTracker with at least 10 rows containing: Decision, Confidence_Level (1-10), Regret_Factor (1-10), Friends_Who_Warned_You, Time_Until_Realization (days). Calculate: (a) Correlation between all numerical columns, (b) Average regret factor, (c) Decisions where Confidence > 7 but Regret > 7, (d) Total warnings ignored.

Hint: Use df.corr() for correlation matrix. Use filtering: df[(df['Confidence_Level'] > 7) & (df['Regret_Factor'] > 7)].

Task 2: Student Grade Management System

Create a DataFrame with 15 students having: Name, Roll_No, Python, Java, WebDev, Database marks. Calculate: (a) Total and Average for each student, (b) Assign grades (A/B/C/D/F), (c) Find top 3 students, (d) Calculate subject-wise averages, (e) Find students below 60% average.

Hint: Add columns using df['Total'] = df[['Python', 'Java', 'WebDev', 'Database']].sum(axis=1). Use nlargest() for top students.

Task 3: Time Tracking Reality Check

Create a DataFrame with activities: Activity, Claimed_Hours, Actual_Hours for activities like Studying, Gaming, Social Media, Sleep, Productive Procrastination. Calculate: (a) Self-deception margin (Claimed - Actual), (b) Total claimed vs actual hours, (c) Activities where Claimed > Actual, (d) Biggest self-deception.

Hint: Add column: df['Deception'] = df['Claimed_Hours'] - df['Actual_Hours']. Use idxmax() to find biggest deception.

Task 4: Multi-City Student Performance

Create a DataFrame with 20 students from 4 cities with Grade and Attendance. Use groupby() to calculate: (a) Average grade per city, (b) Average attendance per city, (c) City with highest performance, (d) Count of students per city, (e) Find city where attendance is high but grades are low.

Hint: Use groupby('City').agg({'Grade': 'mean', 'Attendance': 'mean'}). Compare rankings of cities by grade vs attendance.

Task 5: Semester Progress Tracker

Create a DataFrame with Week (1-18), Study_Hours, Assignment_Completion (%), Motivation_Level (1-10), Coffee_Consumed (cups). Analyze: (a) Correlation between study hours and assignment completion, (b) Week with highest/lowest motivation, (c) Average coffee consumption, (d) Weeks where motivation < 5, (e) Create a “Productivity Score” = (Study_Hours × Assignment_Completion) / (Coffee_Consumed + 1).

Hint: Use df['Week'].loc[df['Motivation_Level'].idxmin()] to find week with lowest motivation. Calculate correlation: df['Study_Hours'].corr(df['Assignment_Completion']).


Summary

  • Pandas DataFrames are table-like structures for data analysis

  • Create DataFrames from dictionaries, lists, or CSV files

  • Select data using [], iloc, loc indexing

  • Filter data using boolean conditions

  • Add/modify columns and rows easily

  • Sort data using sort_values()

  • Group data using groupby() for aggregations

  • Handle missing data with dropna(), fillna()

  • Merge DataFrames using merge() and concat()

  • Calculate correlations using corr()

  • Pandas builds on NumPy and is essential for data analysis