Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 21683 invoked from network); 11 Mar 2008 16:08:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Mar 2008 16:08:16 -0000 Received: (qmail 52906 invoked by uid 500); 11 Mar 2008 16:08:06 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 52872 invoked by uid 500); 11 Mar 2008 16:08:06 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 52863 invoked by uid 99); 11 Mar 2008 16:08:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Mar 2008 09:08:06 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [63.203.238.117] (HELO dns.duboce.net) (63.203.238.117) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Mar 2008 16:07:29 +0000 Received: by dns.duboce.net (Postfix, from userid 1008) id 57A6DC51D; Tue, 11 Mar 2008 07:40:39 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.1.4 (2006-07-26) on dns.duboce.net X-Spam-Level: Received: from durruti.local (unknown [192.168.1.135]) by dns.duboce.net (Postfix) with ESMTP id 0A905C256 for ; Tue, 11 Mar 2008 07:40:37 -0700 (PDT) Message-ID: <47D6ADA7.4070300@duboce.net> Date: Tue, 11 Mar 2008 09:04:55 -0700 From: stack User-Agent: Thunderbird 2.0.0.12 (Macintosh/20080213) MIME-Version: 1.0 To: core-user@hadoop.apache.org Subject: Re: File Per Column in Hadoop References: <2BA91A6D1CB2114491C1F9801058104001C69B@peterson-vuff4b.PetersonTechnologyLLC.local> In-Reply-To: <2BA91A6D1CB2114491C1F9801058104001C69B@peterson-vuff4b.PetersonTechnologyLLC.local> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.1.4 Richard K. Turner wrote: > To get the data in separate files, I would need to put each column in its own column family using a row id as the key. Each column family will end up as a seperate file. HBase will sort each column family independently, so if I had 100 columns I would be doing 100 times more sorting than I need to do. I believe all of this sorting would make insert rates really low. > > > HBase supports an arbitrary number of columns per a row in a column family. To do this each row value has = pairs. For my case this is unessecary overhead as I would only have one column name per column family. > For sure there is a cost keeping columns in a manner that facilitates row-based accesses -- fatter keys and sort/compactions -- and that allows on-the-fly cell-level updates. If your access pattern is purely columnar and your data static, for sure, there is little sense paying the overhead. > It seems that when a map reduce job is run against HBase that it reads input through the HBase server. I suspect reading gzip files off local disk is much faster, but I am not sure. > Yes. We're doing our best to minimize the tax you pay accessing via hbase but we have some ways to go yet. You can get some sense of it from the table at the end of this page, http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation, where we compare accesses that go direct against (non-compressed) mapfiles and then the same access via hbase. Yours, St.Ack > -----Original Message----- > From: Ted Dunning [mailto:tdunning@veoh.com] > Sent: Mon 3/10/2008 2:57 PM > To: core-user@hadoop.apache.org > Subject: Re: File Per Column in Hadoop > > > Have you looked at hbase. It looks like you are trying to reimplement a > bunch of it. > > > On 3/10/08 11:01 AM, "Richard K. Turner" wrote: > > >> ... [storing data in columns is nice] ... I would also do the same for dir >> > csv_file2. Does anyone know how to do this > >> in Hadoop? >> > > >