hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghu Angadi <rang...@yahoo-inc.com>
Subject Re: HDFS Random Access
Date Sat, 27 Jun 2009 16:41:37 GMT

Yes, FSDataInputStream allows random access. There are way to read x 
bytes at a position p:
1) in.seek(p); read(buf, 0, x);
2) in.(p, buf, 0, x);
These two have slightly different semantics. The second one is preferred 
and is easier for HDFS to optimize further.

Random access should be pretty good with HDFS and it is increasingly 
getting more users and thus more importance. HBase is one of the users.

Just yesterday I attached a benchmark and comparissions to random access 
on native filesystem to https://issues.apache.org/jira/browse/HDFS-236 .

As of now, the overhead on average is about 2 ms over 9-10ms it takes 
for native read. There are a few fairly simple fixes possible to reduce 
this gap.

I think getFileStatus() is the way to find the length, though there 
might have been a call added to FSDataInputStream recently. I am not sure.

tsuraan wrote:
> All the documentation for HDFS says that it's for large streaming
> jobs, but I couldn't find an explicit answer to this, so I'll try
> asking here.  How is HDFS's random seek performance within an
> FSDataInputStream?  I use lucene with a lot of indices (potentially
> thousands), so I was thinking of putting them into HDFS and
> reimplementing my search as a Hadoop map-reduce.  I've noticed that
> lucene tends to do a bit of random seeking when searching though; I
> don't believe that it guarantees that all seeks be to increasing file
> positions either.
> Would HDFS be a bad fit for an access pattern that involves seeks to
> random positions within a stream?
> Also, is getFileStatus the typical way of getting the length of a file
> in HDFS, or is there some method on FSDataInputStream that I'm not
> seeing?
> Please cc: me on any reply; I'm not on the hadoop list.  Thanks!

View raw message