Hadoopでデータノードが突然ダウンする障害

Picture large978 4 87311 652 5

取引先でHadoopのデータノードがダウン

取引先のHadoopクラスタにおいて、データノードが突然ダウンする障害が発生しました。
以下のコマンドでDead datanodeのサーバーを検知しました。

[root@datanode04 hadoop-hdfs]# sudo -u hdfs hadoop dfsadmin -report
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Configured Capacity: 8596032528384 (7.82 TB)
Present Capacity: 8596032528384 (7.82 TB)
DFS Remaining: 8012100371990 (7.29 TB)
DFS Used: 583932156394 (543.83 GB)
DFS Used%: 6.79%
Under replicated blocks: 885214
Blocks with corrupt replicas: 0
Missing blocks: 0


Datanodes available: 3 (5 total, 2 dead) Live datanodes: Name: ..*.160:50010 (datanode01.*********) Hostname: datanode01.********* Decommission Status : Normal Configured Capacity: 2865344176128 (2.61 TB) DFS Used: 161568006380 (150.47 GB) Non DFS Used: 0 (0 B) DFS Remaining: 2703776169748 (2.46 TB) DFS Used%: 5.64% DFS Remaining%: 94.36% Last contact: Thu Sep 19 22:49:08 JST 2013 Name: ..*.164:50010 (datanode05.*********) Hostname: datanode05.********* Decommission Status : Normal Configured Capacity: 2865344176128 (2.61 TB) DFS Used: 206684559310 (192.49 GB) Non DFS Used: 0 (0 B) DFS Remaining: 2658659616818 (2.42 TB) DFS Used%: 7.21% DFS Remaining%: 92.79% Last contact: Thu Sep 19 22:49:08 JST 2013 Name: ..*.161:50010 (datanode02.*********) Hostname: datanode02.********* Decommission Status : Normal Configured Capacity: 2865344176128 (2.61 TB) DFS Used: 215679590704 (200.87 GB) Non DFS Used: 0 (0 B) DFS Remaining: 2649664585424 (2.41 TB) DFS Used%: 7.53% DFS Remaining%: 92.47% Last contact: Thu Sep 19 22:49:08 JST 2013 Dead datanodes: Name: ..*.162:50010 (datanode03.*********) Hostname: datanode03.********* Decommission Status : Normal Configured Capacity: 0 (0 B) DFS Used: 0 (0 B) Non DFS Used: 0 (0 B) DFS Remaining: 0 (0 B) DFS Used%: 100.00% DFS Remaining%: 0.00% Last contact: Thu Sep 19 19:28:30 JST 2013 Name: ..*.163:50010 (datanode04.*********) Hostname: datanode04.********* Decommission Status : Normal Configured Capacity: 0 (0 B) DFS Used: 0 (0 B) Non DFS Used: 0 (0 B) DFS Remaining: 0 (0 B) DFS Used%: 100.00% DFS Remaining%: 0.00% Last contact: Thu Sep 19 12:59:03 JST 2013

Hadoop fsckコマンドの結果

デットノードとなったsn03などで#sudo -u hdfs hadoop fsck /コマンドを実行した結果、以下のようなExceptionが吐かれていました。

java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
2013-09-19 13:31:59,444 ERROR org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: Exception during DirectoryScanner execution - will continue next cycle
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
2013-09-19 15:43:03,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-344499712-...150-1364229738423:blk_-4174186284952057055_4319048
java.io.IOException: Premature EOF from inputStream
2013-09-19 15:43:03,224 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-...150-1364229738423:blk_-4174186284952057055_4319048 received exception java.io.IOException: Premature EOF from inputStream
java.io.IOException: Premature EOF from inputStream
2013-09-20 21:33:47,138 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-...150-1364229738423:blk_4860043378865753444_4381869 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_4860043378865753444_4381869 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_4860043378865753444_4381869 already exists in state FINALIZED and thus cannot be created.
2013-09-20 21:33:50,137 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-...150-1364229738423:blk_5963737320053202074_4381874 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_5963737320053202074_4381874 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_5963737320053202074_4381874 already exists in state FINALIZED and thus cannot be created.
2013-09-20 21:33:50,137 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-...150-1364229738423:blk_-5837002631844387540_4381875 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_-5837002631844387540_4381875 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_-5837002631844387540_4381875 already exists in state FINALIZED and thus cannot be created.
2013-09-21 17:41:18,892 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-...150-1364229738423:blk_3113932042876444558_4421687 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_3113932042876444558_4421687 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_3113932042876444558_4421687 already exists in state FINALIZED and thus cannot be created.
2013-09-22 11:06:26,073 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-...150-1364229738423:blk_6261188228073565169_4453553 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_6261188228073565169_4453553 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_6261188228073565169_4453553 already exists in state FINALIZED and thus cannot be created.
2013-09-23 17:47:17,835 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-...150-1364229738423:blk_-1008577895813901266_4514387 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_-1008577895813901266_4514387 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_-1008577895813901266_4514387 already exists in state FINALIZED and thus cannot be created.
java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
2013-09-23 18:08:25,066 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: exception: 
java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
2013-09-23 18:08:25,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Not checking disk as checkDiskError was called on a network related exception
2013-09-25 17:50:06,184 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-...150-1364229738423:blk_-4971668669529423648_4610204 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_-4971668669529423648_4610204 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_-4971668669529423648_4610204 already exists in state FINALIZED and thus cannot be created.
2013-09-26 05:30:16,170 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-...150-1364229738423:blk_7229145188033767785_4631196 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_7229145188033767785_4631196 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-...150-1364229738423:blk_7229145188033767785_4631196 already exists in state FINALIZED and thus cannot be created.
java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
2013-09-26 05:45:03,391 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: exception: 
java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
2013-09-26 05:45:03,391 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Not checking disk as checkDiskError was called on a network related exception
java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
2013-09-26 15:07:22,357 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: exception: 
java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
2013-09-26 15:07:22,357 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Not checking disk as checkDiskError was called on a network related exception
java.io.IOException: Error in deleting blocks.

Out Of Memoryエラー

datanode03,datanode04でそれぞれOutOfMemoryエラーが発生していることを確認しました。ただし、ログが一部しか取れていないため、各発生事象の行全体のメッセージや、事象発生の正確な時間は不明です。

・datanode03のエラー
以下の行でOutOfMemory(すべてGC overhead limit exceeded)が発生
前後のログから時間帯はおおよそ「2013-09-19 13:31:59,444」~「2013-09-19 15:43:03,223」の間と推測。

java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded

対処法

現状ではDatanodeのメモリ量を増やすことによって一時的に問題は解決しました。しかし、なぜ原因が発生したか分からず…。発生バージョンはClouderaのCDH4.3.0です。とりあえず、hadoop dfsadmin -reportコマンドを利用したDataNode生死監視を入れて監視を続けようと思います。