hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Turner <synfina...@gmail.com>
Subject Hadoop archives (.har) are really really slow
Date Mon, 15 Aug 2016 17:00:35 GMT
Basically I want to list all the files in a .har file and compare the
file list/sizes to an existing directory in HDFS.  The problem is that
running commands like: hdfs dfs -ls -R <path to har file> is orders of
magnitude slower then running the same command against a live HDFS
file system.

How much slower?  I've calculated it will take ~19 days to list all
the files in 250TB worth of content spread between 2 .har files.

Is this normal?  Can I do this faster (write a map/reduce job/etc?)

--
Aaron Turner
https://synfin.net/         Twitter: @synfinatic
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Mime
View raw message