Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8977FDD33 for ; Thu, 13 Dec 2012 16:42:37 +0000 (UTC) Received: (qmail 33258 invoked by uid 500); 13 Dec 2012 16:42:35 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 33209 invoked by uid 500); 13 Dec 2012 16:42:35 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 33176 invoked by uid 99); 13 Dec 2012 16:42:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Dec 2012 16:42:35 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of kevin.odell@cloudera.com designates 209.85.212.41 as permitted sender) Received: from [209.85.212.41] (HELO mail-vb0-f41.google.com) (209.85.212.41) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Dec 2012 16:42:30 +0000 Received: by mail-vb0-f41.google.com with SMTP id l22so2659266vbn.14 for ; Thu, 13 Dec 2012 08:42:09 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=Lmq1xPdHwmG05Hu4ExGiQBrskX5flHEaZKd9/xPkQXQ=; b=iEr2wMfPa6nkPijQPT3QIxdfK97FF/Z6gFGIv/L8suWhIgV8CFR8nj8dTJqyUFdJH7 jauhW/ZWfXGCdEVEl90knZNC2vM9W55CFN4YRHaV1yqxxuzwof4LfppS/YwjJVkbZyUy ewtcVN8fy4e99E4+dg/7MjeaLX+9gHSX0A5Rb5WzMpyhvvsw58TIKww9/hCoCFFF2R+n DBcsjSv71/0DcZvi53I3D+cWPvgni4qV86TPSkrr/yzTCyQqeRi3aJpKIIXUmzxgcXMv J9kPM5EBJvLNKwsuscCH19VRIoi75OlffsfCRde5PZ9g7f7kkbK26EDG1/MHYaumFOJo c/Ug== MIME-Version: 1.0 Received: by 10.52.76.40 with SMTP id h8mr3743447vdw.123.1355416929149; Thu, 13 Dec 2012 08:42:09 -0800 (PST) Received: by 10.58.154.40 with HTTP; Thu, 13 Dec 2012 08:42:08 -0800 (PST) In-Reply-To: References: <1355382550.98051.YahooMailNeo@web140605.mail.bf1.yahoo.com> Date: Thu, 13 Dec 2012 11:42:08 -0500 Message-ID: Subject: Re: How to design a data warehouse in HBase? From: "Kevin O'dell" To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=20cf3071c70c95a27d04d0be98c7 X-Gm-Message-State: ALoCoQmm1BsxoowRLLiQQrpcnMOg3fEpzVGDcDsoHantFks8wDLjwLTtj/Z4N9IokR6FhVyG0hJu X-Virus-Checked: Checked by ClamAV on apache.org --20cf3071c70c95a27d04d0be98c7 Content-Type: text/plain; charset=ISO-8859-1 Correct, Impala relies on the Hive Metastore. On Thu, Dec 13, 2012 at 11:38 AM, Manoj Babu wrote: > Kevin, > > Impala requires Hive right? > so to get the advantages of Impala do we need to go with Hive? > > > Cheers! > Manoj. > > > > On Thu, Dec 13, 2012 at 9:03 PM, Mohammad Tariq > wrote: > > > Thank you so much for the clarification Kevin. > > > > Regards, > > Mohammad Tariq > > > > > > > > On Thu, Dec 13, 2012 at 9:00 PM, Kevin O'dell > >wrote: > > > > > Mohammad, > > > > > > I am not sure you are thinking about Impala correctly. It still uses > > > HDFS so your data increasing over time is fine. You are not going to > > need > > > to tune for special CPU, Storage, or Network. Typically with Impala > you > > > are going to be bound at the disks as it functions off of data > locality. > > > You can also use compression of Snappy, GZip, and BZip to help with > the > > > amount of data you are storing. You will not need to frequently update > > > your hardware. > > > > > > On Thu, Dec 13, 2012 at 10:06 AM, Mohammad Tariq > > > wrote: > > > > > > > Oh yes..Impala..good point by Kevin. > > > > > > > > Kevin : Would it be appropriate if I say that I should go for Impala > if > > > my > > > > data is not going to increase dramatically over time or if I have to > > work > > > > on only a subset of my BigData?Since Impala uses MPP, it may > > > > require specialized hardware tuned for CPU, storage and network > > > performance > > > > for better results, which could become a problem if have to upgrade > the > > > > hardware frequently because of the growing data. > > > > > > > > Regards, > > > > Mohammad Tariq > > > > > > > > > > > > > > > > On Thu, Dec 13, 2012 at 8:17 PM, Kevin O'dell < > > kevin.odell@cloudera.com > > > > >wrote: > > > > > > > > > To Mohammad's point. You can use HBase for quick scans of the > data. > > > > Hive > > > > > for your longer running jobs. Impala over the two for quick adhoc > > > > > searches. > > > > > > > > > > On Thu, Dec 13, 2012 at 9:44 AM, Mohammad Tariq < > dontariq@gmail.com> > > > > > wrote: > > > > > > > > > > > I am not saying Hbase is not good. My point was to consider Hive > as > > > > well. > > > > > > Think about the approach keeping both the tools in mind and > > decide. I > > > > > just > > > > > > provided an option keeping in mind the available built-in Hive > > > > features. > > > > > I > > > > > > would like to add one more point here, you can map your Hbase > > tables > > > to > > > > > > Hive. > > > > > > > > > > > > Regards, > > > > > > Mohammad Tariq > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Dec 13, 2012 at 7:58 PM, bigdata < > bigdatabase@outlook.com> > > > > > wrote: > > > > > > > > > > > > > Hi, Tariq > > > > > > > Thanks for your feedback. Actually, now we have two ways to > reach > > > the > > > > > > > target, by Hive and by HBase.Could you tell me why HBase is > not > > > good > > > > > for > > > > > > > my requirements?Or what's the problem in my solution? > > > > > > > Thanks. > > > > > > > > > > > > > > > From: dontariq@gmail.com > > > > > > > > Date: Thu, 13 Dec 2012 15:43:25 +0530 > > > > > > > > Subject: Re: How to design a data warehouse in HBase? > > > > > > > > To: user@hbase.apache.org > > > > > > > > > > > > > > > > Both have got different purposes. Normally people say that > Hive > > > is > > > > > > slow, > > > > > > > > that's just because it uses MapReduce under the hood. And i'm > > > sure > > > > > that > > > > > > > if > > > > > > > > the data stored in HBase is very huge, nobody would write > > > > sequential > > > > > > > > programs for Get or Scan. Instead they will write MP jobs or > do > > > > > > something > > > > > > > > similar. > > > > > > > > > > > > > > > > My point is that nothing can be 100% real time. Is that what > > you > > > > > > want?If > > > > > > > > that is the case I would never suggest Hadoop on the first > > place > > > as > > > > > > it's > > > > > > > a > > > > > > > > batch processing system and cannot be used like an OLTP > system, > > > > > unless > > > > > > > you > > > > > > > > have thought of some additional stuff. Since you are talking > > > about > > > > > > > > warehouse, I am assuming you are going to store and process > > > > gigantic > > > > > > > > amounts of data. That's the only reason I had suggested Hive. > > > > > > > > > > > > > > > > The whole point is that everything is not a solution for > > > > everything. > > > > > > One > > > > > > > > size doesn't fit all. First, we need to analyze our > particular > > > use > > > > > > case. > > > > > > > > The person, who says Hive is slow, might be correct. But only > > for > > > > his > > > > > > > > scenario. > > > > > > > > > > > > > > > > HTH > > > > > > > > > > > > > > > > Regards, > > > > > > > > Mohammad Tariq > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Dec 13, 2012 at 3:17 PM, bigdata < > > > bigdatabase@outlook.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > I've got the information that HIVE 's performance is too > low. > > > It > > > > > > access > > > > > > > > > HDFS files and scan all data to search one record. IS it > > TRUE? > > > > And > > > > > > > HBase is > > > > > > > > > much faster than it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: dontariq@gmail.com > > > > > > > > > > Date: Thu, 13 Dec 2012 15:12:25 +0530 > > > > > > > > > > Subject: Re: How to design a data warehouse in HBase? > > > > > > > > > > To: user@hbase.apache.org > > > > > > > > > > > > > > > > > > > > Hi there, > > > > > > > > > > > > > > > > > > > > If you are really planning for a warehousing solution > > > then I > > > > > > would > > > > > > > > > > suggest you to have a look over Apache Hive. It provides > > you > > > > > > > warehousing > > > > > > > > > > capabilities on top of a Hadoop cluster. Along with that > it > > > > also > > > > > > > provides > > > > > > > > > > an SQLish interface to the data stored in your warehouse, > > > which > > > > > > > would be > > > > > > > > > > very helpful to you, in case you are coming from an SQL > > > > > background. > > > > > > > > > > > > > > > > > > > > HTH > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > Mohammad Tariq > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Dec 13, 2012 at 2:43 PM, bigdata < > > > > > bigdatabase@outlook.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Thanks. I think a real example is better for me to > > > understand > > > > > > your > > > > > > > > > > > suggestions. > > > > > > > > > > > Now I have a relational table:ID LoginTime > > > > > > > > > DeviceID1 > > > > > > > > > > > 2012-12-12 12:12:12 abcdef2 2012-12-12 > 19:12:12 > > > > > > > abcdef3 > > > > > > > > > > > 2012-12-13 10:10:10 defdaf > > > > > > > > > > > There are several requirements about this table:1. How > > many > > > > > > device > > > > > > > > > login > > > > > > > > > > > in each day?1. For one day, how many new device login? > > > (never > > > > > > login > > > > > > > > > > > before)1. For one day, how many accumulated device > login? > > > > > > > > > > > How can I design HBase tables to calculate these > data?Now > > > my > > > > > > > solution > > > > > > > > > > > is:table A: > > > > > > > > > > > rowkey: date-deviceidcolumn family: logincolumn > > qualifier: > > > > > > > 2012-12-12 > > > > > > > > > > > 12:12:12/2012-12-12 19:12:12.... > > > > > > > > > > > table B:rowkey: deviceidcolumn family:null or anyvalue > > > > > > > > > > > > > > > > > > > > > > For req#1, I can scan table A and use > > prefixfilter(rowkey) > > > to > > > > > > > check one > > > > > > > > > > > special date, and get records countFor req#2, I get > > table b > > > > > with > > > > > > > each > > > > > > > > > > > deviceid, and count result > > > > > > > > > > > For req#3, count table A with prefixfilter like 1. > > > > > > > > > > > Does it OK? Or other better solutions? > > > > > > > > > > > Thanks!! > > > > > > > > > > > > > > > > > > > > > > > CC: user@hbase.apache.org > > > > > > > > > > > > From: michael_segel@hotmail.com > > > > > > > > > > > > Subject: Re: How to design a data warehouse in HBase? > > > > > > > > > > > > Date: Thu, 13 Dec 2012 08:43:31 +0000 > > > > > > > > > > > > To: user@hbase.apache.org > > > > > > > > > > > > > > > > > > > > > > > > You need to spend a bit of time on Schema design. > > > > > > > > > > > > You need to flatten your Schema... > > > > > > > > > > > > Implement some secondary indexing to improve join > > > > > > performance... > > > > > > > > > > > > > > > > > > > > > > > > Depends on what you want to do... There are other > > options > > > > > > too... > > > > > > > > > > > > > > > > > > > > > > > > Sent from a remote device. Please excuse any typos... > > > > > > > > > > > > > > > > > > > > > > > > Mike Segel > > > > > > > > > > > > > > > > > > > > > > > > On Dec 13, 2012, at 7:09 AM, lars hofhansl < > > > > > > lhofhansl@yahoo.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > For OLAP type queries you will generally be better > > off > > > > > with a > > > > > > > truly > > > > > > > > > > > column oriented database. > > > > > > > > > > > > > You can probably shoehorn HBase into this, but it > > > wasn't > > > > > > really > > > > > > > > > > > designed with raw scan performance along single columns > > in > > > > > mind. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > > > > > > > > From: bigdata > > > > > > > > > > > > > To: "user@hbase.apache.org" > > > > > > > > > > > > > > Sent: Wednesday, December 12, 2012 9:57 PM > > > > > > > > > > > > > Subject: How to design a data warehouse in HBase? > > > > > > > > > > > > > > > > > > > > > > > > > > Dear all, > > > > > > > > > > > > > We have a traditional star-model data warehouse in > > > RDBMS, > > > > > now > > > > > > > we > > > > > > > > > want > > > > > > > > > > > to transfer it to HBase. After study HBase, I learn > that > > > > HBase > > > > > is > > > > > > > > > normally > > > > > > > > > > > can be query by rowkey. > > > > > > > > > > > > > 1.full rowkey (fastest)2.rowkey filter > (fast)3.column > > > > > > > > > family/qualifier > > > > > > > > > > > filter (slow) > > > > > > > > > > > > > How can I design the HBase tables to implement the > > > > > warehouse > > > > > > > > > > > functions, like:1.Query by DimensionA2.Query by > > DimensionA > > > > and > > > > > > > > > > > DimensionB3.Sum, count, distinct ... > > > > > > > > > > > > > From my opinion, I should create several HBase > tables > > > > with > > > > > > all > > > > > > > > > > > combinations of different dimensions as the rowkey. > This > > > > > solution > > > > > > > will > > > > > > > > > lead > > > > > > > > > > > to huge data duplication. Is there any good suggestions > > to > > > > > solve > > > > > > > it? > > > > > > > > > > > > > Thanks a lot! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Kevin O'Dell > > > > > Customer Operations Engineer, Cloudera > > > > > > > > > > > > > > > > > > > > > -- > > > Kevin O'Dell > > > Customer Operations Engineer, Cloudera > > > > > > -- Kevin O'Dell Customer Operations Engineer, Cloudera --20cf3071c70c95a27d04d0be98c7--