Hadoop safe mode recovery,Taking Long Time

May 23, 2022

ranganathan.madan Hadoop

Hadoop safe mode recovery,Taking Long Time

if NameNode is restarted or started, Some times NameNode goes into maintenance state called Safe Mode. When NameNode is in safemode it does not allow any changes (writes) to the file system. HDFS cluster will be in read-only and NameNode does not even replicate or delete blocks.

Sometimes it will take long time for Namenode to come out of the Safemode. If we check the Namenode logs, Here is the some of the log message reads that Namenode needs information about 8,293 blocks before it can safely come out of Safemode.

The reported blocks 217165 needs additional 8293 blocks to reach the threshold 0.7722 of total blocks 426638. Safe mode will be turned off automatically.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1621)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:2871)
at org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:1765)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:413)
at java.lang.reflect.Method.invoke(Method.java:696)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:598)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:9986)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1262)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:4061)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:769)

Missing or Corrupt Blocks

If you see issues with lot of DataNodes, means may be miss lot of blocks and that keep the NameNode in Safemode for a long time because it would miss a lot of blocks to reach the threshold. In such a case, make sure all DataNodes are up and connected to the NameNode.

The other reason for extended Safemode is if there are blocks that are corrupt and recovery is not possible. In case of corrupted blocks we have to delete the corrupted blocks.

In such scenarios forcefully come out of the Safemode by running the below command.

hadoop dfsadmin -safemode leave

Then use the fsck command to look at the blocks for all file in your cluster.

hdfs fsck /

Look through the output for missing or corrupt blocks (ignore under-replicated blocks for now). This command is really verbose especially on a large HDFS filesystem so run the below command which ignores lines with nothing but dots and lines talking about replication.

hdfs fsck / | egrep -v ‘^.+$’ | grep -v replica

Once you find a file that is corrupt use that output to determine where blocks might live. If the file is larger than your block size it might have multiple blocks.

hdfs fsck /path/to/corrupt/file -locations -blocks -files

If you still can not determine the reason for corrupt or missing blocks you don’t have any other option other than removing the file from the system to make the HDFS healthy.

— hdfs fs -rm /file-with-missing-corrupt blocks.

Hadoop safe mode recovery,Taking Long Time

Written by:

ranganathan.madan

Leave a Reply
Cancel reply

Welcome to IT Maddy

Hadoop safe mode recovery,Taking Long Time

Written by:

ranganathan.madan

Leave a Reply Cancel reply

Leave a Reply
Cancel reply