The total size is 572GB from 36 gzipped FASTQ files.
Now we want to put this datatset into HDFS with replication factor as 3. Then how much space will it use?
#before:
$hdfs dfsadmin -report | less
Configured Capacity: 39378725642240 (35.81 TB)
Present Capacity: 36885873225728 (33.55 TB)
DFS Remaining: 28098049122304 (25.56 TB)
DFS Used: 8787824103424 (7.99 TB)
DFS Used%: 23.82%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
$time hdfs -put NA12878 /fastq/
real 113m45.058s
user 32m0.252s
sys 23m34.897s
#after:
$hdfs dfsadmin -report | less
Present Capacity: 36885873225728 (33.55 TB)
DFS Remaining: 26241290846208 (23.87 TB)
DFS Used: 10644582379520 (9.68 TB)
DFS Used%: 28.86%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Comparing before and after we can see the 572GB dataset actually uses 1.69TB (9.68TB - 7.99TB) HDFS space. That is 3x - exactly the same number of "dfs.replication" in "hdfs-site.xml".
No comments:
Post a Comment