hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Turner <synfina...@gmail.com>
Subject Re: Hadoop archives (.har) are really really slow
Date Mon, 15 Aug 2016 19:38:35 GMT
I can list all the files out of HDFS in a few hours, not a day. Listing the files in a single
directory in the har takes ~50 min.  Honestly I'd be happy with only a 10x performance hit.
I'm seeing closer to 100-150x. 

-Aaron


> On Aug 15, 2016, at 12:33 PM, Tsz Wo Sze <szetszwo@yahoo.com> wrote:
> 
> ls over files in har:// maybe 10 times slow than ls over regular files.  It does not
sound normal unless it would take ~1 day to list out all the 250TB files when they are stored
as regular files.
> Tsz-Wo
> 
> 
> On Monday, August 15, 2016 10:01 AM, Aaron Turner <synfinatic@gmail.com> wrote:
> 
> 
> Basically I want to list all the files in a .har file and compare the
> file list/sizes to an existing directory in HDFS.  The problem is that
> running commands like: hdfs dfs -ls -R <path to har file> is orders of
> magnitude slower then running the same command against a live HDFS
> file system.
> 
> How much slower?  I've calculated it will take ~19 days to list all
> the files in 250TB worth of content spread between 2 .har files.
> 
> Is this normal?  Can I do this faster (write a map/reduce job/etc?)
> 
> --
> Aaron Turner
> https://synfin.net/         Twitter: @synfinatic
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
>     -- Benjamin Franklin
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: user-help@hadoop.apache.org
> 
> 
> 

Mime
View raw message