Return-Path: Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: (qmail 93310 invoked from network); 4 Jan 2011 14:56:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Jan 2011 14:56:32 -0000 Received: (qmail 29509 invoked by uid 500); 4 Jan 2011 14:56:31 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 29228 invoked by uid 500); 4 Jan 2011 14:56:31 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 29220 invoked by uid 99); 4 Jan 2011 14:56:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Jan 2011 14:56:30 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cbsmith@gmail.com designates 74.125.82.48 as permitted sender) Received: from [74.125.82.48] (HELO mail-ww0-f48.google.com) (74.125.82.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Jan 2011 14:56:24 +0000 Received: by wwd20 with SMTP id 20so15044683wwd.29 for ; Tue, 04 Jan 2011 06:56:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=Z3Cz+uEewuZLgAETxUfjV7ZSQjc9ahL8wJmGPGw0rIw=; b=p9RGz2V1qf0+d1ezti70Hlyk/cBoPpPlCIenSdcoAa758Iwh4UcGh3DhKOKnQ+d6Kr /Y0Us5XH9YFK2iJ70idGoXTw91azxSgCpwOkvwoTgipHee1H/ejZpWq1+rvPiO7E3wNK zpUTac5s2GAGV90pdsgiDkYCiJKupru6DSCaQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=c/pd/fvCtiW07yTqLIyoMLFeqzi3a47Jn/lBPp0tklXQHoohSNpBDaKS/VWE0LEuqt Wwzp57CGGC8CYG0wdgqH1o6lArZe0LWhckqhAV0Yv/Xtj5aaOitr3Oq1uyW0UXXFm8nj L0cWTXFjNnGujtj2bsCuUECLFxDMPH7D3ivJM= Received: by 10.216.18.195 with SMTP id l45mr24437993wel.79.1294152961134; Tue, 04 Jan 2011 06:56:01 -0800 (PST) MIME-Version: 1.0 Received: by 10.216.12.75 with HTTP; Tue, 4 Jan 2011 06:55:40 -0800 (PST) In-Reply-To: <605F0FB7-72AB-4773-A1B7-1D6937A4A16A@cse.unl.edu> References: <4D22196B.9020902@cs.jhu.edu> <605F0FB7-72AB-4773-A1B7-1D6937A4A16A@cse.unl.edu> From: Christopher Smith Date: Tue, 4 Jan 2011 06:55:40 -0800 Message-ID: Subject: Re: Hadoop use direct I/O in Linux? To: common-dev@hadoop.apache.org Content-Type: multipart/alternative; boundary=0014853bbd7888c7b90499067748 X-Virus-Checked: Checked by ClamAV on apache.org --0014853bbd7888c7b90499067748 Content-Type: text/plain; charset=ISO-8859-1 On Mon, Jan 3, 2011 at 7:15 PM, Brian Bockelman wrote: > The I/O pattern isn't truly random. To convert from physicist terms to CS > terms, the application is iterating through the rows of a column-oriented > store, reading out somewhere between 1 and 10% of the columns. The twist is > that the columns are compressed, meaning the size of a set of rows on disk > is variable. > We're getting pretty far off topic here, but this is an interesting problem. It *sounds* to me like a "compressed bitmap index" problem, possibly with bloom filters for joins (basically what HBase/Cassandra/Hypertable get in to, or in a less distributed case: MonetDB). Is that on the money? > This prevents any sort of OS page cache stride detection from helping - > the OS sees everything as random. > It seems though like if you organized the data a certain way, the OS page cache could help. > However, the application also has an index of where each row is located, > meaning if it knows the active set of columns, it can predict the reads the > client will perform and do a read-ahead. > Yes, this is the kind of advantage where O_DIRECT might help, although I'd hope in this kind of circumstance the OS buffer cache would mostly give up anyway and just give as much of the available RAM as possible to the app. In that case memory mapped files with a thread doing a bit of read ahead would seem like not that much slower than using O_DIRECT. That said, I have to wonder how often this problem devolves in to a straight forward column scan. I mean, with a 1-10% hit rate, you need SSD seek times for it to make sense to seek to specific records vs. just scanning through the whole column, or to put it another way: "disk is the new tape". ;-) > Some days, it does feel like "building a better TCP using UDP". However, > we got a 3x performance improvement by building it (and multiplying by > 10-15k cores for just our LHC experiment, that's real money!), so it's a > particular monstrosity we are stuck with. It sure sounds like a problem better suited to C++ than Java though. What benefits do you yield from doing all this with a JVM? -- Chris --0014853bbd7888c7b90499067748--