ミライハック
  • Home
  • Categories
  • About

>> Home / インフラ

Hadoopでデータノードが突然ダウンする障害

∵ Takayoshi Saito ∴ 2015-08-07 ∞ 2'

Picture large978 4 87311 652 5

取引先でHadoopのデータノードがダウン

取引先のHadoopクラスタにおいて、データノードが突然ダウンする障害が発生しました。 以下のコマンドでDead datanodeのサーバーを検知しました。

# sudo -u hdfs hadoop dfsadmin -report
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Configured Capacity: 8596032528384 (7.82 TB)
Present Capacity: 8596032528384 (7.82 TB)
DFS Remaining: 8012100371990 (7.29 TB)
DFS Used: 583932156394 (543.83 GB)
DFS Used%: 6.79%
Under replicated blocks: 885214
Blocks with corrupt replicas: 0
Missing blocks: 0

---

Datanodes available: 3 (5 total, 2 dead)

Live datanodes:
Name: *.*.*.160:50010 (datanode01.*********)
Hostname: datanode01.*********
Decommission Status : Normal
Configured Capacity: 2865344176128 (2.61 TB)
DFS Used: 161568006380 (150.47 GB)
Non DFS Used: 0 (0 B)
DFS Remaining: 2703776169748 (2.46 TB)
DFS Used%: 5.64%
DFS Remaining%: 94.36%
Last contact: Thu Sep 19 22:49:08 JST 2013

Name: *.*.*.164:50010 (datanode05.*********)
Hostname: datanode05.*********
Decommission Status : Normal
Configured Capacity: 2865344176128 (2.61 TB)
DFS Used: 206684559310 (192.49 GB)
Non DFS Used: 0 (0 B)
DFS Remaining: 2658659616818 (2.42 TB)
DFS Used%: 7.21%
DFS Remaining%: 92.79%
Last contact: Thu Sep 19 22:49:08 JST 2013

Name: *.*.*.161:50010 (datanode02.*********)
Hostname: datanode02.*********
Decommission Status : Normal
Configured Capacity: 2865344176128 (2.61 TB)
DFS Used: 215679590704 (200.87 GB)
Non DFS Used: 0 (0 B)
DFS Remaining: 2649664585424 (2.41 TB)
DFS Used%: 7.53%
DFS Remaining%: 92.47%
Last contact: Thu Sep 19 22:49:08 JST 2013

Dead datanodes:
Name: *.*.*.162:50010 (datanode03.*********)
Hostname: datanode03.*********
Decommission Status : Normal
Configured Capacity: 0 (0 B)
DFS Used: 0 (0 B)
Non DFS Used: 0 (0 B)
DFS Remaining: 0 (0 B)
DFS Used%: 100.00%
DFS Remaining%: 0.00%
Last contact: Thu Sep 19 19:28:30 JST 2013

Name: *.*.*.163:50010 (datanode04.*********)
Hostname: datanode04.*********
Decommission Status : Normal
Configured Capacity: 0 (0 B)
DFS Used: 0 (0 B)
Non DFS Used: 0 (0 B)
DFS Remaining: 0 (0 B)
DFS Used%: 100.00%
DFS Remaining%: 0.00%
Last contact: Thu Sep 19 12:59:03 JST 2013 

Hadoop fsckコマンドの結果

デットノードとなったsn03などで#sudo -u hdfs hadoop fsck /コマンドを実行した結果、以下のようなExceptionが吐かれていました。

java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
2013-09-19 13:31:59,444 ERROR org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: Exception during DirectoryScanner execution - will continue next cycle
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
2013-09-19 15:43:03,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-344499712-*.*.*.150-1364229738423:blk_-4174186284952057055_4319048
java.io.IOException: Premature EOF from inputStream
2013-09-19 15:43:03,224 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-*.*.*.150-1364229738423:blk_-4174186284952057055_4319048 received exception java.io.IOException: Premature EOF from inputStream
java.io.IOException: Premature EOF from inputStream
2013-09-20 21:33:47,138 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-*.*.*.150-1364229738423:blk_4860043378865753444_4381869 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_4860043378865753444_4381869 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_4860043378865753444_4381869 already exists in state FINALIZED and thus cannot be created.
2013-09-20 21:33:50,137 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-*.*.*.150-1364229738423:blk_5963737320053202074_4381874 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_5963737320053202074_4381874 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_5963737320053202074_4381874 already exists in state FINALIZED and thus cannot be created.
2013-09-20 21:33:50,137 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-*.*.*.150-1364229738423:blk_-5837002631844387540_4381875 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_-5837002631844387540_4381875 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_-5837002631844387540_4381875 already exists in state FINALIZED and thus cannot be created.
2013-09-21 17:41:18,892 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-*.*.*.150-1364229738423:blk_3113932042876444558_4421687 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_3113932042876444558_4421687 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_3113932042876444558_4421687 already exists in state FINALIZED and thus cannot be created.
2013-09-22 11:06:26,073 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-*.*.*.150-1364229738423:blk_6261188228073565169_4453553 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_6261188228073565169_4453553 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_6261188228073565169_4453553 already exists in state FINALIZED and thus cannot be created.
2013-09-23 17:47:17,835 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-*.*.*.150-1364229738423:blk_-1008577895813901266_4514387 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_-1008577895813901266_4514387 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_-1008577895813901266_4514387 already exists in state FINALIZED and thus cannot be created.
java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
2013-09-23 18:08:25,066 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: exception: 
java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
2013-09-23 18:08:25,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Not checking disk as checkDiskError was called on a network related exception
2013-09-25 17:50:06,184 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-*.*.*.150-1364229738423:blk_-4971668669529423648_4610204 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_-4971668669529423648_4610204 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_-4971668669529423648_4610204 already exists in state FINALIZED and thus cannot be created.
2013-09-26 05:30:16,170 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-344499712-*.*.*.150-1364229738423:blk_7229145188033767785_4631196 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_7229145188033767785_4631196 already exists in state FINALIZED and thus cannot be created.
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-344499712-*.*.*.150-1364229738423:blk_7229145188033767785_4631196 already exists in state FINALIZED and thus cannot be created.
java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
2013-09-26 05:45:03,391 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: exception: 
java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
2013-09-26 05:45:03,391 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Not checking disk as checkDiskError was called on a network related exception
java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
2013-09-26 15:07:22,357 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: exception: 
java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
2013-09-26 15:07:22,357 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Not checking disk as checkDiskError was called on a network related exception
java.io.IOException: Error in deleting blocks.

Out Of Memoryエラー

datanode03,datanode04でそれぞれOutOfMemoryエラーが発生していることを確認しました。ただし、ログが一部しか取れていないため、各発生事象の行全体のメッセージや、事象発生の正確な時間は不明です。

・datanode03のエラー 以下の行でOutOfMemory(すべてGC overhead limit exceeded)が発生 前後のログから時間帯はおおよそ「2013-09-19 13:31:59,444」~「2013-09-19 15:43:03,223」の間と推測。

java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded

対処法

現状ではDatanodeのメモリ量を増やすことによって一時的に問題は解決しました。しかし、なぜ原因が発生したか分からず…。発生バージョンはClouderaのCDH4.3.0です。とりあえず、hadoop dfsadmin -reportコマンドを利用したDataNode生死監視を入れて監視を続けようと思います。

Search

Categories
  • LInux
  • インターネット
  • インフラ
  • エッセイ
  • ゲーム
  • システム開発
  • セキュリティ
  • データサイエンス
  • 国際関係
  • 政治
  • 歴史
  • 社会学
  • 自己紹介
  • 行ってきた

Pages
  • 齊藤貴義
  • 職務経歴
  • スクレイピング・ハッキング・ラボ サポートページ
  • 『爆速開発を支えるClaude Code上級者テクニック』サポートページ

2026 © Takayoshi Saito | Twitter GitHub | Built on Zola