Example: let’s assume you are running a cluster with
3 data nodes and you are using a replication level of
3, ie. you want
3 good copies on
3 different machines of each block on your hdfs. If now one of your block copies goes bad (if it’s bit rot, ie. not detected through the checksum verification done during read access, the default settings will let this go unnoticed for up to around 3 weeks), the name node will get notified at some point and then look for a free data node, ie. one that has space and does not yet contain that specific block. And because one needs to manually remove bad blocks from HDFS, it won’t find any free data node until this manual intervention happens. So your replication level for that block will be below the target of
3 until you manually run the corresponding
fsck command. In other words: even if your
3 data nodes work flawlessly except for random bit rot, and would have enough free space for re-creating good copies, sooner or later your entire HDFS will be turned into dust without that manual repair.
Conclusion: if you really want to make sure that the replication level is kept at target level, make a contingency plan. If you want to allow up to two bad replicas for one and the same block without losing your replication level for that block until someone notices it through monitoring and fixes it, you’ll need at least two extra data nodes, ie.
5 in our example.
You might also want to think about tinkering with
dfs.block.scanner.volume.bytes.per.second in order to be able to detect bit rot as early as possible.
If you want to tinker around with HDFS behaviour, my scripts are available at github: https://github.com/jjYBdx4IL/example-maven-project-setups/tree/master/hdfs-example
HDFS configuration parameters reference: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
HDFS version used: 2.8.1