Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of kevin.odell@cloudera.com
 designates 209.85.212.41 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CACvfG=rsDeHi3ruS2z2j8D+WG+4cVa2cC74yKGS8fiu-Qm88vw@mail.gmail.com>
References: <SNT002-W1697AA8694D5342CD409CEAB54E0@phx.gbl>
	<1355382550.98051.YahooMailNeo@web140605.mail.bf1.yahoo.com>
	<BLU0-SMTP2890A887EC9AC195336CCC78F4E0@phx.gbl>
	<SNT002-W1781344BC079DFE5E91C376B54E0@phx.gbl>
	<CAMVC6ROS2k4KAbWwoDpuG0-9ekwek3OBOSNTRkS+Nmprm5VOLQ@mail.gmail.com>
	<SNT002-W2204D704333F1C196A7E40B54E0@phx.gbl>
	<CAMVC6ROzPQBzEsoDJouVYcUuYS1OHWvqO+DLO6S-5wuOzCWrRA@mail.gmail.com>
	<SNT002-W3742E4BA2E4D12365AB1CBB54E0@phx.gbl>
	<CAMVC6RMHmF6gPPrH1xf_oBiTOh1j09C0OzOFWEBHNacUuY6qpw@mail.gmail.com>
	<CAGngS9dC3POWgizbLGX0XWZToLCt_Taqjew7J=LPi6J32A+-sw@mail.gmail.com>
	<CAMVC6RPzCKyT+3+JFogHuuQfsuD0WSGK2spHwBeNKrrMVyUnyQ@mail.gmail.com>
	<CAGngS9dvQLa9rMFYZqf8F8zbYm1909DPcnM57YxqRTWv5ZF5Ow@mail.gmail.com>
	<CAMVC6RMGc8sDCgxX5GQqCPZ9D3purwGd88N8mHivFAv2ZS58GQ@mail.gmail.com>
	<CACvfG=rsDeHi3ruS2z2j8D+WG+4cVa2cC74yKGS8fiu-Qm88vw@mail.gmail.com>
Date: Thu, 13 Dec 2012 11:42:08 -0500
Message-ID: 
 <CAGngS9cnKXPNkSnh8raK_rawovZug5yQrf5AV=mkdfWUb1+_XA@mail.gmail.com>
Subject: Re: How to design a data warehouse in HBase?
From: "Kevin O'dell" <kevin.odell@cloudera.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=20cf3071c70c95a27d04d0be98c7

--20cf3071c70c95a27d04d0be98c7
Content-Type: text/plain; charset=ISO-8859-1

Correct, Impala relies on the Hive Metastore.

On Thu, Dec 13, 2012 at 11:38 AM, Manoj Babu <manoj444@gmail.com> wrote:

> Kevin,
>
> Impala requires Hive right?
> so to get the advantages of Impala do we need to go with Hive?
>
>
> Cheers!
> Manoj.
>
>
>
> On Thu, Dec 13, 2012 at 9:03 PM, Mohammad Tariq <dontariq@gmail.com>
> wrote:
>
> > Thank you so much for the clarification Kevin.
> >
> > Regards,
> >     Mohammad Tariq
> >
> >
> >
> > On Thu, Dec 13, 2012 at 9:00 PM, Kevin O'dell <kevin.odell@cloudera.com
> > >wrote:
> >
> > > Mohammad,
> > >
> > >   I am not sure you are thinking about Impala correctly.  It still uses
> > > HDFS so your data increasing over time is fine.  You are not going to
> > need
> > > to tune for special CPU, Storage, or Network.  Typically with Impala
> you
> > > are going to be bound at the disks as it functions off of data
> locality.
> > >  You can also use compression of Snappy, GZip, and BZip to help with
> the
> > > amount of data you are storing.  You will not need to frequently update
> > > your hardware.
> > >
> > > On Thu, Dec 13, 2012 at 10:06 AM, Mohammad Tariq <dontariq@gmail.com>
> > > wrote:
> > >
> > > > Oh yes..Impala..good point by Kevin.
> > > >
> > > > Kevin : Would it be appropriate if I say that I should go for Impala
> if
> > > my
> > > > data is not going to increase dramatically over time or if I have to
> > work
> > > > on only a subset of my BigData?Since Impala uses MPP, it may
> > > > require specialized hardware tuned for CPU, storage and network
> > > performance
> > > > for better results, which could become a problem if have to upgrade
> the
> > > > hardware frequently because of the growing data.
> > > >
> > > > Regards,
> > > >     Mohammad Tariq
> > > >
> > > >
> > > >
> > > > On Thu, Dec 13, 2012 at 8:17 PM, Kevin O'dell <
> > kevin.odell@cloudera.com
> > > > >wrote:
> > > >
> > > > > To Mohammad's point.  You can use HBase for quick scans of the
> data.
> > > >  Hive
> > > > > for your longer running jobs.  Impala over the two for quick adhoc
> > > > > searches.
> > > > >
> > > > > On Thu, Dec 13, 2012 at 9:44 AM, Mohammad Tariq <
> dontariq@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I am not saying Hbase is not good. My point was to consider Hive
> as
> > > > well.
> > > > > > Think about the approach keeping both the tools in mind and
> > decide. I
> > > > > just
> > > > > > provided an option keeping in mind the available built-in Hive
> > > > features.
> > > > > I
> > > > > > would like to add one more point here, you can map your Hbase
> > tables
> > > to
> > > > > > Hive.
> > > > > >
> > > > > > Regards,
> > > > > >     Mohammad Tariq
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Dec 13, 2012 at 7:58 PM, bigdata <
> bigdatabase@outlook.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi, Tariq
> > > > > > > Thanks for your feedback. Actually, now we have two ways to
> reach
> > > the
> > > > > > > target, by Hive and  by HBase.Could you tell me why HBase is
> not
> > > good
> > > > > for
> > > > > > > my requirements?Or what's the problem in my solution?
> > > > > > > Thanks.
> > > > > > >
> > > > > > > > From: dontariq@gmail.com
> > > > > > > > Date: Thu, 13 Dec 2012 15:43:25 +0530
> > > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > > To: user@hbase.apache.org
> > > > > > > >
> > > > > > > > Both have got different purposes. Normally people say that
> Hive
> > > is
> > > > > > slow,
> > > > > > > > that's just because it uses MapReduce under the hood. And i'm
> > > sure
> > > > > that
> > > > > > > if
> > > > > > > > the data stored in HBase is very huge, nobody would write
> > > > sequential
> > > > > > > > programs for Get or Scan. Instead they will write MP jobs or
> do
> > > > > > something
> > > > > > > > similar.
> > > > > > > >
> > > > > > > > My point is that nothing can be 100% real time. Is that what
> > you
> > > > > > want?If
> > > > > > > > that is the case I would never suggest Hadoop on the first
> > place
> > > as
> > > > > > it's
> > > > > > > a
> > > > > > > > batch processing system and cannot be used like an OLTP
> system,
> > > > > unless
> > > > > > > you
> > > > > > > > have thought of some additional stuff. Since you are talking
> > > about
> > > > > > > > warehouse, I am assuming you are going to store and process
> > > > gigantic
> > > > > > > > amounts of data. That's the only reason I had suggested Hive.
> > > > > > > >
> > > > > > > > The whole point is that everything is not a solution for
> > > > everything.
> > > > > > One
> > > > > > > > size doesn't fit all. First, we need to analyze our
> particular
> > > use
> > > > > > case.
> > > > > > > > The person, who says Hive is slow, might be correct. But only
> > for
> > > > his
> > > > > > > > scenario.
> > > > > > > >
> > > > > > > > HTH
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >     Mohammad Tariq
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Dec 13, 2012 at 3:17 PM, bigdata <
> > > bigdatabase@outlook.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > > I've got the information that HIVE 's performance is too
> low.
> > > It
> > > > > > access
> > > > > > > > > HDFS files and scan all data to search one record. IS it
> > TRUE?
> > > > And
> > > > > > > HBase is
> > > > > > > > > much faster than it.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: dontariq@gmail.com
> > > > > > > > > > Date: Thu, 13 Dec 2012 15:12:25 +0530
> > > > > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > >
> > > > > > > > > > Hi there,
> > > > > > > > > >
> > > > > > > > > >    If you are really planning for a warehousing solution
> > > then I
> > > > > > would
> > > > > > > > > > suggest you to have a look over Apache Hive. It provides
> > you
> > > > > > > warehousing
> > > > > > > > > > capabilities on top of a Hadoop cluster. Along with that
> it
> > > > also
> > > > > > > provides
> > > > > > > > > > an SQLish interface to the data stored in your warehouse,
> > > which
> > > > > > > would be
> > > > > > > > > > very helpful to you, in case you are coming from an SQL
> > > > > background.
> > > > > > > > > >
> > > > > > > > > > HTH
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > >     Mohammad Tariq
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Dec 13, 2012 at 2:43 PM, bigdata <
> > > > > bigdatabase@outlook.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks. I think a real example is better for me to
> > > understand
> > > > > > your
> > > > > > > > > > > suggestions.
> > > > > > > > > > > Now I have a relational table:ID   LoginTime
> > > > > > > > >  DeviceID1
> > > > > > > > > > >     2012-12-12 12:12:12   abcdef2     2012-12-12
> 19:12:12
> > > > > > > abcdef3
> > > > > > > > > > >  2012-12-13 10:10:10  defdaf
> > > > > > > > > > > There are several requirements about this table:1. How
> > many
> > > > > > device
> > > > > > > > > login
> > > > > > > > > > > in each day?1. For one day, how many new device login?
> > > (never
> > > > > > login
> > > > > > > > > > > before)1. For one day, how many accumulated device
> login?
> > > > > > > > > > > How can I design HBase tables to calculate these
> data?Now
> > > my
> > > > > > > solution
> > > > > > > > > > > is:table A:
> > > > > > > > > > > rowkey:  date-deviceidcolumn family: logincolumn
> > qualifier:
> > > > > > >  2012-12-12
> > > > > > > > > > > 12:12:12/2012-12-12 19:12:12....
> > > > > > > > > > > table B:rowkey: deviceidcolumn family:null or anyvalue
> > > > > > > > > > >
> > > > > > > > > > > For req#1, I can scan table A and use
> > prefixfilter(rowkey)
> > > to
> > > > > > > check one
> > > > > > > > > > > special date, and get records countFor req#2, I get
> > table b
> > > > > with
> > > > > > > each
> > > > > > > > > > > deviceid, and count result
> > > > > > > > > > > For req#3, count table A with prefixfilter like 1.
> > > > > > > > > > > Does it OK?  Or other better solutions?
> > > > > > > > > > > Thanks!!
> > > > > > > > > > >
> > > > > > > > > > > > CC: user@hbase.apache.org
> > > > > > > > > > > > From: michael_segel@hotmail.com
> > > > > > > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > > > > > > Date: Thu, 13 Dec 2012 08:43:31 +0000
> > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > >
> > > > > > > > > > > > You need to spend a bit of time on Schema design.
> > > > > > > > > > > > You need to flatten your Schema...
> > > > > > > > > > > > Implement some secondary indexing to improve join
> > > > > > performance...
> > > > > > > > > > > >
> > > > > > > > > > > > Depends on what you want to do... There are other
> > options
> > > > > > too...
> > > > > > > > > > > >
> > > > > > > > > > > > Sent from a remote device. Please excuse any typos...
> > > > > > > > > > > >
> > > > > > > > > > > > Mike Segel
> > > > > > > > > > > >
> > > > > > > > > > > > On Dec 13, 2012, at 7:09 AM, lars hofhansl <
> > > > > > lhofhansl@yahoo.com>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > For OLAP type queries you will generally be better
> > off
> > > > > with a
> > > > > > > truly
> > > > > > > > > > > column oriented database.
> > > > > > > > > > > > > You can probably shoehorn HBase into this, but it
> > > wasn't
> > > > > > really
> > > > > > > > > > > designed with raw scan performance along single columns
> > in
> > > > > mind.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ________________________________
> > > > > > > > > > > > > From: bigdata <bigdatabase@outlook.com>
> > > > > > > > > > > > > To: "user@hbase.apache.org" <user@hbase.apache.org
> >
> > > > > > > > > > > > > Sent: Wednesday, December 12, 2012 9:57 PM
> > > > > > > > > > > > > Subject: How to design a data warehouse in HBase?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Dear all,
> > > > > > > > > > > > > We have a traditional star-model data warehouse in
> > > RDBMS,
> > > > > now
> > > > > > > we
> > > > > > > > > want
> > > > > > > > > > > to transfer it to HBase. After study HBase, I learn
> that
> > > > HBase
> > > > > is
> > > > > > > > > normally
> > > > > > > > > > > can be query by rowkey.
> > > > > > > > > > > > > 1.full rowkey (fastest)2.rowkey filter
> (fast)3.column
> > > > > > > > > family/qualifier
> > > > > > > > > > > filter (slow)
> > > > > > > > > > > > > How can I design the HBase tables to implement the
> > > > > warehouse
> > > > > > > > > > > functions, like:1.Query by DimensionA2.Query by
> > DimensionA
> > > > and
> > > > > > > > > > > DimensionB3.Sum, count, distinct ...
> > > > > > > > > > > > > From my opinion, I should create several HBase
> tables
> > > > with
> > > > > > all
> > > > > > > > > > > combinations of different dimensions as the rowkey.
> This
> > > > > solution
> > > > > > > will
> > > > > > > > > lead
> > > > > > > > > > > to huge data duplication. Is there any good suggestions
> > to
> > > > > solve
> > > > > > > it?
> > > > > > > > > > > > > Thanks a lot!
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Kevin O'Dell
> > > > > Customer Operations Engineer, Cloudera
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Kevin O'Dell
> > > Customer Operations Engineer, Cloudera
> > >
> >
>


-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

--20cf3071c70c95a27d04d0be98c7--