Understanding Pandas DataFrame Operations
Data manipulation is at the heart of any data‑science workflow, and Pandas provides a powerful DataFrame object to handle tabular data. This course walks you through the most frequently used DataFrame operations, from sorting and selecting rows to grouping, adding columns, merging, and aggregating. By the end of the lesson you will be able to answer common interview questions and write clean, efficient Pandas code.
Sorting DataFrames with sort_values
The DataFrame.sort_values method orders rows based on one or more column values. Two arguments are essential for controlling the sort behavior:
- ascending: Determines the direction of the sort. Set
ascending=Truefor an ascending order (the default) orascending=Falsefor descending order. - na_position: Controls where
NaNvalues appear. Usena_position='first'to place missing values at the top of the sorted result, orna_position='last'to push them to the bottom.
When you also pass ignore_index=True, Pandas discards the original index and creates a fresh sequential index starting at 0. This is useful when the original index no longer reflects the logical order of the data after sorting.
Example:
df_sorted = df.sort_values(
by='sepal_length',
ascending=False,
na_position='first',
ignore_index=True
)
In this snippet the DataFrame is sorted by sepal_length in descending order, any NaN values appear first, and the index is reset to a clean range.
Selecting Rows Using loc
The loc accessor is the most readable way to filter rows based on label conditions. To retrieve rows where a column matches a specific value, combine loc with a boolean expression.
iris_virginica = df.loc[df['class'] == 'Iris-virginica']
This command returns a new DataFrame containing only the rows whose class column equals 'Iris-virginica'. Because loc works with labels, you can also slice columns simultaneously, e.g., df.loc[df['class'] == 'Iris-virginica', ['sepal_length', 'petal_width']].
Grouping Data with groupby
The groupby operation is a two‑step process. First, calling df.groupby('class') returns a DataFrameGroupBy object. This object does not modify the original DataFrame; it simply stores the grouping information.
Only after you apply an aggregation function—such as mean(), sum(), or median()—does Pandas compute a new DataFrame that reflects the grouped results.
grouped = df.groupby('class')
# No calculation yet – grouped is a DataFrameGroupBy object
median_sepal = grouped['sepal_length'].median()
Notice that the original DataFrame remains unchanged throughout the process. This lazy evaluation model enables you to chain multiple operations efficiently.
Adding New Columns
Adding a column is straightforward: assign a list, NumPy array, or Pandas Series to a new column label.
df['age'] = [23, 45, 31, 27]
This line creates a new column called age and fills it with the supplied values. The length of the list must match the number of rows in the DataFrame, otherwise Pandas raises a ValueError. Alternative methods like df.assign(age=[...]) also work, but the direct assignment syntax is the most common and most readable.
Merging DataFrames
Combining two DataFrames is performed with pd.merge (or the DataFrame method merge). The on parameter specifies the column(s) that serve as the join key.
merged_df = pd.merge(left_df, right_df, on='id', how='inner')
In this example, rows with matching id values from both left_df and right_df are combined using an inner join. Other join types—left, right, outer—are controlled by the how argument. Optional arguments like left_index or right_index let you join on index values instead of explicit columns.
Common Aggregation Functions
After grouping, you often need summary statistics. Pandas offers a rich set of aggregation functions:
- mean: average value of each group.
- sum: total of numeric values per group.
- count: number of non‑null observations per group.
- median: middle value, useful when the distribution is skewed.
To obtain the median sepal_length for each flower class, you would write:
median_by_class = df.groupby('class')['sepal_length'].median()
The result is a Series indexed by the unique values in class, each containing the median of sepal_length for that class.
Putting It All Together: A Mini‑Project
Imagine you have a dataset of Iris flower measurements and you want to produce a clean, sorted summary table that shows the median sepal_length for each class, with missing values placed at the top and a fresh index.
# 1. Load the data
import pandas as pd
df = pd.read_csv('iris.csv')
# 2. Compute median sepal length per class
median_df = df.groupby('class', as_index=False)['sepal_length'].median()
# 3. Sort the result, putting NaN first and resetting the index
final_df = median_df.sort_values(
by='sepal_length',
ascending=True,
na_position='first',
ignore_index=True
)
print(final_df)
This workflow demonstrates the synergy of the concepts covered in this course: groupby for aggregation, sort_values for ordering, na_position for handling missing data, and ignore_index for a clean final presentation.
Key Takeaways
- Sorting: Use
ascendingto set direction andna_positionto control whereNaNvalues appear. - Index Management:
ignore_index=Truecreates a new sequential index after sorting. - Row Selection:
df.loc[df['column'] == value]is the most readable way to filter rows by a condition. - GroupBy Mechanics: The object returned by
groupbyis a lazyDataFrameGroupBythat does not alter the original data until an aggregation is called. - Adding Columns: Direct assignment
df['new_col'] = [...]is the standard method. - Merging: The
onparameter defines the join key(s) formerge. - Aggregation: Functions like
medianprovide robust summary statistics, especially for skewed data.
Mastering these operations will make you more efficient in data cleaning, exploratory analysis, and feature engineering—core skills for any data scientist working with Pandas.