hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Segel, Mike" <mse...@navteq.com>
Subject RE: Hadoop use direct I/O in Linux?
Date Tue, 04 Jan 2011 15:12:30 GMT
While this is an interesting topic for debate, I think it's a moot point. 
A lot of DBAs (Especially Informix DBAs) don't agree with Linus. (I'm referring to an earlier
post in this thread that referenced a quote from Linus T.) Direct I/O is a good thing. But
if Linus is removing it from Linux... 

But with respect to Hadoop... disk i/o shouldn't be a major topic. I mean if it were, then
why isn't anyone pushing the use of SSDs? Or if they are too expensive for your budget, why
not SAS drives that spin at 15K?
Ok, those points are rhetorical. The simple solution is that if you're i/o bound, you add
more nodes with more disk to further distribute the load, right?

Also, I may be wrong, but do all OS(s) that one can run Hadoop, handle Direct I/O? And handle
it in a common way? So won't you end up having machine/OS specific classes?

IMHO there are other features that don't yet exist in Hadoop/HBase that will yield a better

Ok, so I may be way off base, so I'll shut up now... ;-P


-----Original Message-----
From: Christopher Smith [mailto:cbsmith@gmail.com] 
Sent: Tuesday, January 04, 2011 8:56 AM
To: common-dev@hadoop.apache.org
Subject: Re: Hadoop use direct I/O in Linux?

On Mon, Jan 3, 2011 at 7:15 PM, Brian Bockelman <bbockelm@cse.unl.edu>wrote:

> The I/O pattern isn't truly random.  To convert from physicist terms to CS
> terms, the application is iterating through the rows of a column-oriented
> store, reading out somewhere between 1 and 10% of the columns.  The twist is
> that the columns are compressed, meaning the size of a set of rows on disk
> is variable.

We're getting pretty far off topic here, but this is an interesting problem.
It *sounds* to me like a "compressed bitmap index" problem, possibly with
bloom filters for joins (basically what HBase/Cassandra/Hypertable get in
to, or in a less distributed case: MonetDB). Is that on the money?

>  This prevents any sort of OS page cache stride detection from helping -
> the OS sees everything as random.

It seems though like if you organized the data a certain way, the OS page
cache could help.

>   However, the application also has an index of where each row is located,
> meaning if it knows the active set of columns, it can predict the reads the
> client will perform and do a read-ahead.

Yes, this is the kind of advantage where O_DIRECT might help, although I'd
hope in this kind of circumstance the OS buffer cache would mostly give up
anyway and just give as much of the available RAM as possible to the app. In
that case memory mapped files with a thread doing a bit of read ahead would
seem like not that much slower than using O_DIRECT.

That said, I have to wonder how often this problem devolves in to a straight
forward column scan. I mean, with a 1-10% hit rate, you need SSD seek times
for it to make sense to seek to specific records vs. just scanning through
the whole column, or to put it another way: "disk is the new tape". ;-)

> Some days, it does feel like "building a better TCP using UDP".  However,
> we got a 3x performance improvement by building it (and multiplying by
> 10-15k cores for just our LHC experiment, that's real money!), so it's a
> particular monstrosity we are stuck with.

It sure sounds like a problem better suited to C++ than Java though. What
benefits do you yield from doing all this with a JVM?


The information contained in this communication may be CONFIDENTIAL and is intended only for
the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby
notified that any dissemination, distribution, or copying of this communication, or any of
its contents, is strictly prohibited.  If you have received this communication in error, please
notify the sender and delete/destroy the original message and any copy of it from your computer
or paper files.

View raw message