Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-dev@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of cbsmith@gmail.com designates
 74.125.82.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type;
        b=c/pd/fvCtiW07yTqLIyoMLFeqzi3a47Jn/lBPp0tklXQHoohSNpBDaKS/VWE0LEuqt
         Wwzp57CGGC8CYG0wdgqH1o6lArZe0LWhckqhAV0Yv/Xtj5aaOitr3Oq1uyW0UXXFm8nj
         L0cWTXFjNnGujtj2bsCuUECLFxDMPH7D3ivJM=
MIME-Version: 1.0
In-Reply-To: <605F0FB7-72AB-4773-A1B7-1D6937A4A16A@cse.unl.edu>
References: <4D22196B.9020902@cs.jhu.edu>
 <DAB97229-C9A3-4BE8-B587-DAFCAC2A65B6@cse.unl.edu>
 <AANLkTi=AoqF0DF3t=Y27M4YQqw26SMRJJU3MvXb6mFhP@mail.gmail.com>
 <F57BF8B6-3093-44F3-B107-2A33A2A6CB82@cse.unl.edu>
 <AANLkTimtHGA1Waei8bXYOf1tLTWvSgECgRbp6yKCm5wL@mail.gmail.com>
 <605F0FB7-72AB-4773-A1B7-1D6937A4A16A@cse.unl.edu>
From: Christopher Smith <cbsmith@gmail.com>
Date: Tue, 4 Jan 2011 06:55:40 -0800
Message-ID: <AANLkTim3yZZFUKmHa9_cQYkcV9a3z7_oYtL18SH+4P5v@mail.gmail.com>
Subject: Re: Hadoop use direct I/O in Linux?
To: common-dev@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0014853bbd7888c7b90499067748

--0014853bbd7888c7b90499067748
Content-Type: text/plain; charset=ISO-8859-1

On Mon, Jan 3, 2011 at 7:15 PM, Brian Bockelman <bbockelm@cse.unl.edu>wrote:

> The I/O pattern isn't truly random.  To convert from physicist terms to CS
> terms, the application is iterating through the rows of a column-oriented
> store, reading out somewhere between 1 and 10% of the columns.  The twist is
> that the columns are compressed, meaning the size of a set of rows on disk
> is variable.
>

We're getting pretty far off topic here, but this is an interesting problem.
It *sounds* to me like a "compressed bitmap index" problem, possibly with
bloom filters for joins (basically what HBase/Cassandra/Hypertable get in
to, or in a less distributed case: MonetDB). Is that on the money?


>  This prevents any sort of OS page cache stride detection from helping -
> the OS sees everything as random.
>

It seems though like if you organized the data a certain way, the OS page
cache could help.


>   However, the application also has an index of where each row is located,
> meaning if it knows the active set of columns, it can predict the reads the
> client will perform and do a read-ahead.
>

Yes, this is the kind of advantage where O_DIRECT might help, although I'd
hope in this kind of circumstance the OS buffer cache would mostly give up
anyway and just give as much of the available RAM as possible to the app. In
that case memory mapped files with a thread doing a bit of read ahead would
seem like not that much slower than using O_DIRECT.

That said, I have to wonder how often this problem devolves in to a straight
forward column scan. I mean, with a 1-10% hit rate, you need SSD seek times
for it to make sense to seek to specific records vs. just scanning through
the whole column, or to put it another way: "disk is the new tape". ;-)


> Some days, it does feel like "building a better TCP using UDP".  However,
> we got a 3x performance improvement by building it (and multiplying by
> 10-15k cores for just our LHC experiment, that's real money!), so it's a
> particular monstrosity we are stuck with.


It sure sounds like a problem better suited to C++ than Java though. What
benefits do you yield from doing all this with a JVM?

-- 
Chris

--0014853bbd7888c7b90499067748--