.. _module5_pandas_dataframes: Pandas DataFrames: Data Analysis Made Easy =========================================== Introduction ------------ Pandas is a powerful Python library for data manipulation and analysis. It provides DataFrames - table-like structures that make working with structured data intuitive and efficient. .. note:: Pandas is built on top of NumPy and is the go-to library for data analysis in Python. Think of it as Excel on steroids, but programmable and much more powerful! 🐼 -------------- Installing Pandas ----------------- .. code-block:: bash pip install pandas -------------- Baby Steps: Creating DataFrames -------------------------------- **1. From Dictionaries** .. code-block:: python import pandas as pd # Create DataFrame from dictionary data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Diana'], 'Age': [20, 21, 19, 22], 'Grade': [85, 78, 92, 88] } df = pd.DataFrame(data) print(df) # Name Age Grade # 0 Alice 20 85 # 1 Bob 21 78 # 2 Charlie 19 92 # 3 Diana 22 88 # Check data types print(df.dtypes) # Name object # Age int64 # Grade int64 **2. From Lists of Lists** .. code-block:: python import pandas as pd # Data as list of lists data = [ ['Alice', 20, 85], ['Bob', 21, 78], ['Charlie', 19, 92], ['Diana', 22, 88] ] df = pd.DataFrame(data, columns=['Name', 'Age', 'Grade']) print(df) **3. From CSV Files** .. code-block:: python import pandas as pd # Read from CSV file df = pd.read_csv('students.csv') # Save to CSV file df.to_csv('output.csv', index=False) .. note:: CSV (Comma-Separated Values) files are the most common way to store tabular data. Pandas makes reading and writing CSV files incredibly easy! -------------- DataFrame Basics ---------------- .. code-block:: python import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], 'Age': [20, 21, 19, 22, 20], 'Grade': [85, 78, 92, 88, 75], 'City': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Delhi'] } df = pd.DataFrame(data) # Basic information print(df.shape) # (5, 4) - 5 rows, 4 columns print(df.columns) # Column names print(df.index) # Row indices print(len(df)) # Number of rows # First and last rows print(df.head(3)) # First 3 rows print(df.tail(2)) # Last 2 rows # Statistical summary print(df.describe()) # Statistics for numerical columns # Information about DataFrame print(df.info()) # Data types and memory usage -------------- Selecting Data -------------- .. code-block:: python import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Diana'], 'Age': [20, 21, 19, 22], 'Grade': [85, 78, 92, 88], 'City': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore'] } df = pd.DataFrame(data) # Select single column (returns Series) print(df['Name']) # 0 Alice # 1 Bob # 2 Charlie # 3 Diana # Select multiple columns (returns DataFrame) print(df[['Name', 'Grade']]) # Select rows by index print(df.iloc[0]) # First row print(df.iloc[1:3]) # Rows 1 and 2 # Select by label (if index has labels) print(df.loc[0, 'Name']) # Value at row 0, column 'Name' # Select specific cells print(df.at[0, 'Grade']) # 85 .. note:: ``iloc`` uses integer positions (0, 1, 2...), while ``loc`` uses labels. For beginners, think of ``iloc`` as "integer location" and ``loc`` as "label location". -------------- Filtering Data -------------- .. code-block:: python import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], 'Age': [20, 21, 19, 22, 20], 'Grade': [85, 78, 92, 88, 75], 'City': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Delhi'] } df = pd.DataFrame(data) # Filter students with grade >= 80 high_scorers = df[df['Grade'] >= 80] print(high_scorers) # Multiple conditions (AND) young_high_scorers = df[(df['Age'] < 21) & (df['Grade'] >= 85)] print(young_high_scorers) # Multiple conditions (OR) mumbai_or_high = df[(df['City'] == 'Mumbai') | (df['Grade'] >= 90)] print(mumbai_or_high) # Filter by string methods names_with_e = df[df['Name'].str.contains('e')] print(names_with_e) .. note:: Use ``&`` for AND, ``|`` for OR, and ``~`` for NOT in conditions. Always use parentheses around each condition! -------------- Adding and Modifying Data -------------------------- .. code-block:: python import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [20, 21, 19], 'Grade': [85, 78, 92] } df = pd.DataFrame(data) # Add new column df['Pass'] = df['Grade'] >= 80 print(df) # Add calculated column df['Grade_Percent'] = (df['Grade'] / 100) * 100 df['Age_Group'] = df['Age'].apply(lambda x: 'Teen' if x < 20 else 'Adult') # Modify existing column df['Grade'] = df['Grade'] + 5 # Bonus marks! # Add new row new_student = {'Name': 'Diana', 'Age': 22, 'Grade': 88, 'Pass': True} df = pd.concat([df, pd.DataFrame([new_student])], ignore_index=True) # Drop column df = df.drop('Grade_Percent', axis=1) # Drop row df = df.drop(0) # Drop first row -------------- Sorting Data ------------ .. code-block:: python import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Diana'], 'Age': [20, 21, 19, 22], 'Grade': [85, 78, 92, 88] } df = pd.DataFrame(data) # Sort by single column df_sorted = df.sort_values('Grade', ascending=False) print(df_sorted) # Sort by multiple columns df_sorted = df.sort_values(['Age', 'Grade'], ascending=[True, False]) print(df_sorted) # Sort index df_sorted = df.sort_index() -------------- Grouping and Aggregation ------------------------- .. code-block:: python import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'], 'City': ['Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi'], 'Grade': [85, 78, 92, 88, 75, 82], 'Age': [20, 21, 19, 22, 20, 21] } df = pd.DataFrame(data) # Group by city and calculate mean city_avg = df.groupby('City')['Grade'].mean() print(city_avg) # City # Delhi 82.666667 # Mumbai 84.000000 # Multiple aggregations city_stats = df.groupby('City').agg({ 'Grade': ['mean', 'min', 'max'], 'Age': 'mean' }) print(city_stats) # Count students per city city_count = df.groupby('City').size() print(city_count) .. note:: ``groupby()`` is like SQL's GROUP BY - it splits data into groups and applies functions to each group independently. -------------- Real-World Example: Student Performance Dashboard -------------------------------------------------- .. code-block:: python import pandas as pd import numpy as np # Create realistic student data np.random.seed(42) data = { 'Student_ID': range(1, 21), 'Name': [f'Student_{i}' for i in range(1, 21)], 'Python': np.random.randint(40, 100, 20), 'Java': np.random.randint(40, 100, 20), 'WebDev': np.random.randint(40, 100, 20), 'Database': np.random.randint(40, 100, 20), 'Attendance': np.random.randint(60, 100, 20) } df = pd.DataFrame(data) # Calculate total and average df['Total'] = df[['Python', 'Java', 'WebDev', 'Database']].sum(axis=1) df['Average'] = df[['Python', 'Java', 'WebDev', 'Database']].mean(axis=1) # Assign grades def assign_grade(avg): if avg >= 90: return 'A' elif avg >= 75: return 'B' elif avg >= 60: return 'C' elif avg >= 40: return 'D' else: return 'F' df['Grade'] = df['Average'].apply(assign_grade) # Find top 5 students top_5 = df.nlargest(5, 'Average')[['Name', 'Average', 'Grade']] print("=== Top 5 Students ===") print(top_5) # Subject-wise performance print("\n=== Subject Averages ===") subjects = ['Python', 'Java', 'WebDev', 'Database'] for subject in subjects: avg = df[subject].mean() print(f"{subject}: {avg:.2f}") # Students needing attention struggling = df[df['Average'] < 60] print(f"\n=== Students Needing Help: {len(struggling)} ===") print(struggling[['Name', 'Average', 'Attendance']]) -------------- Handling Missing Data --------------------- .. code-block:: python import pandas as pd import numpy as np data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Diana'], 'Age': [20, np.nan, 19, 22], 'Grade': [85, 78, np.nan, 88] } df = pd.DataFrame(data) # Check for missing values print(df.isnull()) # Boolean DataFrame print(df.isnull().sum()) # Count per column # Drop rows with any missing values df_dropped = df.dropna() # Fill missing values df_filled = df.fillna(0) # Fill with 0 df_filled = df.fillna(df.mean()) # Fill with column mean df_filled = df.fillna(method='ffill') # Forward fill df_filled = df.fillna({'Age': 20, 'Grade': 75}) # Different values per column -------------- Merging and Joining DataFrames ------------------------------- .. code-block:: python import pandas as pd # Student details students = pd.DataFrame({ 'Student_ID': [1, 2, 3, 4], 'Name': ['Alice', 'Bob', 'Charlie', 'Diana'] }) # Student grades grades = pd.DataFrame({ 'Student_ID': [1, 2, 3, 4], 'Grade': [85, 78, 92, 88], 'Subject': ['Python', 'Python', 'Python', 'Python'] }) # Merge on Student_ID merged = pd.merge(students, grades, on='Student_ID') print(merged) # Concatenate vertically (stack DataFrames) more_students = pd.DataFrame({ 'Student_ID': [5, 6], 'Name': ['Eve', 'Frank'] }) all_students = pd.concat([students, more_students], ignore_index=True) -------------- Complete Example: Life Choices Tracker --------------------------------------- .. code-block:: python import pandas as pd import numpy as np # Create the infamous Life Choices Tracker data = { 'Decision': [ 'Chose CS', 'Bought Expensive Course', 'Believed I Could Code', 'Started Learning AI', 'Said Yes to Group Project', 'Procrastinated Assignment', 'Pulled All-Nighter', 'Skipped Documentation', 'Used ChatGPT for Assignment', 'Promised to Exercise' ], 'Confidence_Level': [9, 8, 10, 7, 6, 3, 8, 9, 7, 10], 'Regret_Factor': [2, 7, 5, 6, 9, 10, 7, 8, 4, 9], 'Friends_Who_Warned_You': [2, 5, 1, 3, 7, 8, 6, 4, 2, 10], 'Time_Until_Realization': [365, 30, 7, 60, 2, 1, 1, 14, 3, 1] } df = pd.DataFrame(data) print("=== Life Choices Tracker ===\n") print(df) # Calculate correlation matrix print("\n=== Concerning Correlations ===") correlations = df[['Confidence_Level', 'Regret_Factor', 'Friends_Who_Warned_You', 'Time_Until_Realization']].corr() print(correlations) # Key insights print("\n=== Reality Checks ===") # Correlation between confidence and regret conf_regret_corr = df['Confidence_Level'].corr(df['Regret_Factor']) print(f"Confidence vs Regret correlation: {conf_regret_corr:.2f}") if conf_regret_corr < -0.3: print("⚠️ Higher confidence = More regret (Dunning-Kruger effect detected!)") # Decisions with high regret high_regret = df[df['Regret_Factor'] >= 8] print(f"\nDecisions you wish you could undo: {len(high_regret)}") print(high_regret[['Decision', 'Regret_Factor']]) # Average time to reality check avg_time = df['Time_Until_Realization'].mean() print(f"\nAverage days until reality hits: {avg_time:.1f} days") # Find most confident yet most regretted decision df['Confidence_Regret_Gap'] = df['Confidence_Level'] - df['Regret_Factor'] worst_decision = df.loc[df['Confidence_Regret_Gap'].idxmin()] print(f"\nWorst decision: {worst_decision['Decision']}") print(f"Confidence: {worst_decision['Confidence_Level']}, Regret: {worst_decision['Regret_Factor']}") # Warnings ignored total_warnings = df['Friends_Who_Warned_You'].sum() print(f"\nTotal friend warnings ignored: {total_warnings}") print("(Maybe listen to your friends? Just a thought.)") -------------- Advanced: String Operations ---------------------------- .. code-block:: python import pandas as pd data = { 'Name': ['Alice Smith', 'Bob JONES', 'charlie brown'], 'Email': ['alice@email.com', 'BOB@EMAIL.COM', 'charlie@email.com'] } df = pd.DataFrame(data) # String methods df['Name_Upper'] = df['Name'].str.upper() df['Name_Lower'] = df['Name'].str.lower() df['First_Name'] = df['Name'].str.split().str[0] df['Email_Domain'] = df['Email'].str.split('@').str[1] df['Name_Length'] = df['Name'].str.len() print(df) -------------- Tasks ----- **Task 1: Personal Life Choices Analyzer** Create a DataFrame called ``LifeChoicesTracker`` with at least 10 rows containing: Decision, Confidence_Level (1-10), Regret_Factor (1-10), Friends_Who_Warned_You, Time_Until_Realization (days). Calculate: (a) Correlation between all numerical columns, (b) Average regret factor, (c) Decisions where Confidence > 7 but Regret > 7, (d) Total warnings ignored. *Hint:* Use ``df.corr()`` for correlation matrix. Use filtering: ``df[(df['Confidence_Level'] > 7) & (df['Regret_Factor'] > 7)]``. **Task 2: Student Grade Management System** Create a DataFrame with 15 students having: Name, Roll_No, Python, Java, WebDev, Database marks. Calculate: (a) Total and Average for each student, (b) Assign grades (A/B/C/D/F), (c) Find top 3 students, (d) Calculate subject-wise averages, (e) Find students below 60% average. *Hint:* Add columns using ``df['Total'] = df[['Python', 'Java', 'WebDev', 'Database']].sum(axis=1)``. Use ``nlargest()`` for top students. **Task 3: Time Tracking Reality Check** Create a DataFrame with activities: Activity, Claimed_Hours, Actual_Hours for activities like Studying, Gaming, Social Media, Sleep, Productive Procrastination. Calculate: (a) Self-deception margin (Claimed - Actual), (b) Total claimed vs actual hours, (c) Activities where Claimed > Actual, (d) Biggest self-deception. *Hint:* Add column: ``df['Deception'] = df['Claimed_Hours'] - df['Actual_Hours']``. Use ``idxmax()`` to find biggest deception. **Task 4: Multi-City Student Performance** Create a DataFrame with 20 students from 4 cities with Grade and Attendance. Use ``groupby()`` to calculate: (a) Average grade per city, (b) Average attendance per city, (c) City with highest performance, (d) Count of students per city, (e) Find city where attendance is high but grades are low. *Hint:* Use ``groupby('City').agg({'Grade': 'mean', 'Attendance': 'mean'})``. Compare rankings of cities by grade vs attendance. **Task 5: Semester Progress Tracker** Create a DataFrame with Week (1-18), Study_Hours, Assignment_Completion (%), Motivation_Level (1-10), Coffee_Consumed (cups). Analyze: (a) Correlation between study hours and assignment completion, (b) Week with highest/lowest motivation, (c) Average coffee consumption, (d) Weeks where motivation < 5, (e) Create a "Productivity Score" = (Study_Hours × Assignment_Completion) / (Coffee_Consumed + 1). *Hint:* Use ``df['Week'].loc[df['Motivation_Level'].idxmin()]`` to find week with lowest motivation. Calculate correlation: ``df['Study_Hours'].corr(df['Assignment_Completion'])``. -------------- Summary ------- - Pandas DataFrames are table-like structures for data analysis - Create DataFrames from dictionaries, lists, or CSV files - Select data using ``[]``, ``iloc``, ``loc`` indexing - Filter data using boolean conditions - Add/modify columns and rows easily - Sort data using ``sort_values()`` - Group data using ``groupby()`` for aggregations - Handle missing data with ``dropna()``, ``fillna()`` - Merge DataFrames using ``merge()`` and ``concat()`` - Calculate correlations using ``corr()`` - Pandas builds on NumPy and is essential for data analysis