Understanding HDFS Architecture and Operations
Hadoop Distributed File System (HDFS) is the storage backbone of the Apache Hadoop ecosystem. Designed to run on commodity hardware, HDFS provides high throughput access to large data sets and tolerates hardware failures gracefully. This course breaks down the core concepts that appear in typical HDFS quizzes, explains why they matter, and shows how they fit together in real‑world deployments.
Core Components of HDFS
HDFS follows a master‑slave architecture. The three primary daemons are NameNode, DataNode, and SecondaryNameNode. Understanding their responsibilities is essential for mastering HDFS operations.
NameNode – the Metadata Manager
The NameNode stores the namespace image (fsimage) and the edit log. It knows the directory tree, file permissions, and the mapping of each file block to the DataNodes that host its replicas. Because the NameNode holds the authoritative metadata, it must be highly available and well‑protected.
- Namespace operations: create, delete, rename files and directories.
- Block allocation: when a client writes data, the NameNode assigns block IDs and returns a list of DataNode addresses.
- Read assistance: during a file read, the NameNode supplies the client with the locations of the replicas for each requested block.
DataNode – the Storage Worker
Each DataNode manages a set of local disks that store actual block data. DataNodes periodically send heartbeats and block reports to the NameNode, confirming their health and the blocks they hold.
- Read requests: serve block data to clients or other DataNodes.
- Write pipeline: receive data from a client and forward it to the next replica in the pipeline.
- Failure handling: if a DataNode crashes, the NameNode detects the missing heartbeats and schedules replication to restore the desired replication factor.
SecondaryNameNode – the Checkpoint Helper
Despite its name, the SecondaryNameNode does not act as a hot standby. Its primary job is to periodically merge the edit log into the fsimage file, creating a new checkpoint. This reduces the size of the edit log and speeds up NameNode restarts.
Data Blocks, Replication, and Block Size
HDFS stores files as a collection of fixed‑size blocks. The default block size is 64 MiB, which is much larger than the 4 KiB blocks typical of traditional file systems. Larger blocks reduce the number of seeks required to read a file, improving throughput for big‑data workloads.
Replication Factor
To achieve fault tolerance, each block is replicated across multiple DataNodes. The default replication factor is 3 copies. This means that for every block, HDFS maintains three identical replicas on distinct nodes. If one DataNode fails, the remaining two replicas ensure uninterrupted reads, and the NameNode triggers the creation of a new replica to restore the factor to three.
Why 64 MiB?
Choosing a larger block size offers two key advantages:
- Reduced seek overhead: fewer blocks per file mean fewer metadata lookups and disk seeks.
- Efficient network utilization: when streaming data across the cluster, each transfer moves a sizable chunk, lowering protocol overhead.
Administrators can adjust the block size per file or globally, but 64 MiB remains the recommended default for most Hadoop installations.
File Read Workflow
When a client wants to read a file, the following steps occur:
- The client contacts the NameNode to obtain the block list and the addresses of the DataNodes that store each block.
- The NameNode returns a block location map, typically containing three DataNode addresses per block (reflecting the replication factor).
- The client then streams the data directly from the chosen DataNode(s). If a DataNode becomes unavailable mid‑stream, the client automatically switches to another replica.
This design ensures that the NameNode is only involved in the metadata lookup phase, keeping the data path highly scalable.
Handling DataNode Failures During Reads
If a DataNode fails while serving a block, HDFS does not abort the operation. Instead, the client reads the block from the remaining replica DataNodes. The client’s read logic retries with another replica from the list supplied by the NameNode, guaranteeing continuity without user intervention.
Write Semantics and File Creation
HDFS follows a Write Once Read Many (WORM) model. Once a file is closed, it cannot be modified in place; you can only append new data or create a new file. This simplifies consistency and enables high‑throughput writes.
Creating a New File
The write path involves several components:
- FSDataOutputStream: the client‑side API that initiates the write.
- DataStreamer: an internal thread that contacts the NameNode for block locations and streams data to the chosen DataNodes.
- NameNode: allocates block IDs and returns a pipeline of DataNode addresses for each block.
During file creation, the DataStreamer requests block locations from the NameNode. The NameNode replies with a list of DataNodes for the first block, and the DataStreamer opens a write pipeline to those nodes.
Duplicate File Handling
If a client attempts to create a file that already exists, the NameNode enforces namespace rules by throwing an IOException. This protects existing data from accidental overwrites. Clients can choose to delete the old file first or use a different filename.
Maintenance Operations: Checkpointing and Replication Management
Two background processes keep the HDFS cluster healthy:
- Checkpointing (performed by the SecondaryNameNode): merges the edit log into the fsimage, creating a fresh snapshot of the namespace.
- Replication monitoring (handled by the NameNode): ensures that each block maintains the configured replication factor, launching new replicas when nodes fail.
These tasks run periodically and are transparent to end users, but they are crucial for fast NameNode restarts and data durability.
Best Practices for Optimizing HDFS Performance
To get the most out of HDFS, consider the following recommendations:
- Align block size with workload: large analytical jobs benefit from the default 64 MiB blocks, while smaller, random‑access workloads may require custom block sizes.
- Monitor replication health: use Hadoop’s
fsckand web UI to detect under‑replicated blocks and trigger corrective actions. - Plan for NameNode high availability: deploy a standby NameNode and configure automatic failover to avoid a single point of failure.
- Leverage append support wisely: while HDFS allows appends, frequent small appends can degrade performance; batch appends when possible.
Summary of Key Concepts
- The default replication factor in HDFS is 3 copies.
- During a read, the NameNode provides the client with DataNode addresses for each block.
- If a DataNode fails mid‑read, the client reads the block from another replica.
- The SecondaryNameNode merges the edit log into the fsimage to create checkpoints.
- HDFS uses a typical block size of 64 MiB to reduce seek time.
- When creating a file, the DataStreamer requests block locations from the NameNode.
- HDFS follows a Write Once Read Many (WORM) semantics.
- Attempting to create an existing file causes the NameNode to throw an IOException.
By mastering these fundamentals, you will be prepared to design, troubleshoot, and optimize Hadoop clusters that rely on HDFS for reliable, scalable storage.