Pandas DataFrame Operations

Understanding Pandas DataFrame Operations

Data manipulation is at the heart of any data‑science workflow, and Pandas provides a powerful DataFrame object to handle tabular data. This course walks you through the most frequently used DataFrame operations, from sorting and selecting rows to grouping, adding columns, merging, and aggregating. By the end of the lesson you will be able to answer common interview questions and write clean, efficient Pandas code.

Sorting DataFrames with `sort_values`

The DataFrame.sort_values method orders rows based on one or more column values. Two arguments are essential for controlling the sort behavior:

ascending: Determines the direction of the sort. Set ascending=True for an ascending order (the default) or ascending=False for descending order.
na_position: Controls where NaN values appear. Use na_position='first' to place missing values at the top of the sorted result, or na_position='last' to push them to the bottom.

When you also pass ignore_index=True, Pandas discards the original index and creates a fresh sequential index starting at 0. This is useful when the original index no longer reflects the logical order of the data after sorting.

Example:

df_sorted = df.sort_values(
    by='sepal_length',
    ascending=False,
    na_position='first',
    ignore_index=True
)

In this snippet the DataFrame is sorted by sepal_length in descending order, any NaN values appear first, and the index is reset to a clean range.

Selecting Rows Using `loc`

The loc accessor is the most readable way to filter rows based on label conditions. To retrieve rows where a column matches a specific value, combine loc with a boolean expression.

iris_virginica = df.loc[df['class'] == 'Iris-virginica']

This command returns a new DataFrame containing only the rows whose class column equals 'Iris-virginica'. Because loc works with labels, you can also slice columns simultaneously, e.g., df.loc[df['class'] == 'Iris-virginica', ['sepal_length', 'petal_width']].

Grouping Data with `groupby`

The groupby operation is a two‑step process. First, calling df.groupby('class') returns a DataFrameGroupBy object. This object does not modify the original DataFrame; it simply stores the grouping information.

Only after you apply an aggregation function—such as mean(), sum(), or median()—does Pandas compute a new DataFrame that reflects the grouped results.

grouped = df.groupby('class')
# No calculation yet – grouped is a DataFrameGroupBy object
median_sepal = grouped['sepal_length'].median()

Notice that the original DataFrame remains unchanged throughout the process. This lazy evaluation model enables you to chain multiple operations efficiently.

Adding New Columns

Adding a column is straightforward: assign a list, NumPy array, or Pandas Series to a new column label.

df['age'] = [23, 45, 31, 27]

This line creates a new column called age and fills it with the supplied values. The length of the list must match the number of rows in the DataFrame, otherwise Pandas raises a ValueError. Alternative methods like df.assign(age=[...]) also work, but the direct assignment syntax is the most common and most readable.

Merging DataFrames

Combining two DataFrames is performed with pd.merge (or the DataFrame method merge). The on parameter specifies the column(s) that serve as the join key.

merged_df = pd.merge(left_df, right_df, on='id', how='inner')

In this example, rows with matching id values from both left_df and right_df are combined using an inner join. Other join types—left, right, outer—are controlled by the how argument. Optional arguments like left_index or right_index let you join on index values instead of explicit columns.

Common Aggregation Functions

After grouping, you often need summary statistics. Pandas offers a rich set of aggregation functions:

mean: average value of each group.
sum: total of numeric values per group.
count: number of non‑null observations per group.
median: middle value, useful when the distribution is skewed.

To obtain the median sepal_length for each flower class, you would write:

median_by_class = df.groupby('class')['sepal_length'].median()

The result is a Series indexed by the unique values in class, each containing the median of sepal_length for that class.

Putting It All Together: A Mini‑Project

Imagine you have a dataset of Iris flower measurements and you want to produce a clean, sorted summary table that shows the median sepal_length for each class, with missing values placed at the top and a fresh index.

# 1. Load the data
import pandas as pd
df = pd.read_csv('iris.csv')

# 2. Compute median sepal length per class
median_df = df.groupby('class', as_index=False)['sepal_length'].median()

# 3. Sort the result, putting NaN first and resetting the index
final_df = median_df.sort_values(
    by='sepal_length',
    ascending=True,
    na_position='first',
    ignore_index=True
)

print(final_df)

This workflow demonstrates the synergy of the concepts covered in this course: groupby for aggregation, sort_values for ordering, na_position for handling missing data, and ignore_index for a clean final presentation.

Key Takeaways

Sorting: Use ascending to set direction and na_position to control where NaN values appear.
Index Management: ignore_index=True creates a new sequential index after sorting.
Row Selection: df.loc[df['column'] == value] is the most readable way to filter rows by a condition.
GroupBy Mechanics: The object returned by groupby is a lazy DataFrameGroupBy that does not alter the original data until an aggregation is called.
Adding Columns: Direct assignment df['new_col'] = [...] is the standard method.
Merging: The on parameter defines the join key(s) for merge.
Aggregation: Functions like median provide robust summary statistics, especially for skewed data.

Mastering these operations will make you more efficient in data cleaning, exploratory analysis, and feature engineering—core skills for any data scientist working with Pandas.

Pandas DataFrame Operations

Which argument of DataFrame.sort_values determines the order direction of the sort?

If you want NaN values to appear at the beginning of a sorted DataFrame, which parameter value should you set?

Which method would you use to select rows where the column 'class' equals 'Iris-virginica'?

When using DataFrame.groupby, which of the following statements is true about the returned object before any aggregation is applied?

Which of the following calls correctly adds a new column named 'age' using a list of values?

In a merge operation, which parameter specifies the column(s) on which the two DataFrames are joined?

If you set ignore_index=True while sorting a DataFrame, what is the effect on the index?

Which aggregation function would you use with groupby to obtain the median of 'sepal_length' for each flower class?

When filtering columns using df.filter(items=[...]), what type of argument does 'items' expect?

Which sorting algorithm can be selected via the 'kind' argument for a stable sort?