Friday, April 10, 2015

How much space does a file use in HDFS

I have downloaded NA12878 from http://www.illumina.com/platinumgenomes/.

The total size is 572GB from 36 gzipped FASTQ files.

Now we want to put this datatset into HDFS with replication factor as 3. Then how much space will it use? 

#before:
$hdfs dfsadmin -report | less

Configured Capacity: 39378725642240 (35.81 TB)
Present Capacity: 36885873225728 (33.55 TB)
DFS Remaining: 28098049122304 (25.56 TB)
DFS Used: 8787824103424 (7.99 TB)
DFS Used%: 23.82%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0


$time hdfs -put NA12878 /fastq/

real    113m45.058s
user    32m0.252s

sys     23m34.897s


#after:
$hdfs dfsadmin -report | less

Configured Capacity: 39378725642240 (35.81 TB)
Present Capacity: 36885873225728 (33.55 TB)
DFS Remaining: 26241290846208 (23.87 TB)
DFS Used: 10644582379520 (9.68 TB)
DFS Used%: 28.86%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0


Comparing before and after we can see the 572GB dataset actually uses 1.69TB (9.68TB - 7.99TB) HDFS space. That is 3x - exactly the same number of "dfs.replication" in "hdfs-site.xml".



No comments:

Post a Comment