Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 36C5690C3 for ; Thu, 23 Feb 2012 19:30:46 +0000 (UTC) Received: (qmail 63101 invoked by uid 500); 23 Feb 2012 19:30:42 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 63038 invoked by uid 500); 23 Feb 2012 19:30:42 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 62995 invoked by uid 500); 23 Feb 2012 19:30:42 -0000 Delivered-To: apmail-hadoop-hbase-user@hadoop.apache.org Received: (qmail 62978 invoked by uid 99); 23 Feb 2012 19:30:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Feb 2012 19:30:42 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [98.139.91.224] (HELO nm23-vm0.bullet.mail.sp2.yahoo.com) (98.139.91.224) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 23 Feb 2012 19:30:36 +0000 Received: from [98.139.91.66] by nm23.bullet.mail.sp2.yahoo.com with NNFMP; 23 Feb 2012 19:30:15 -0000 Received: from [98.139.91.37] by tm6.bullet.mail.sp2.yahoo.com with NNFMP; 23 Feb 2012 19:30:15 -0000 Received: from [127.0.0.1] by omp1037.mail.sp2.yahoo.com with NNFMP; 23 Feb 2012 19:30:15 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 348344.58297.bm@omp1037.mail.sp2.yahoo.com Received: (qmail 31519 invoked by uid 60001); 23 Feb 2012 19:30:14 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1330025414; bh=s2rVWRmHai2JThp2sBsGE0N7IS7aBQ2AcIz+fSh58q0=; h=X-YMail-OSG:Received:X-RocketYMMF:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=2cg92J+yMuekUb3mXrQVAFtMVwQlA0KQ3IL7w06Ba9nZCKchcVdRQ2axCREwTXtzIxg9ZseKhUf6vgj8hkmUzwrsYtRIq9dmtjIHTc4GrP4IOAil/VB/WaWZnITyAjG6g8Prq7c7QSoucNs+8vMBmRBouYjuGKp4Rayw/C/Xu54= X-YMail-OSG: 3hIFdq0VM1n8TRScZ4IQF0LLY8.UWHI7Lx8udgfKBJcx6dX QDxVlkVHxg7VeAtSE1XpfFlUepBahACdKKclTrFfFTYwunNNzQQKScbSlRCY 9pRXNERmgdIm8JCIqWl.4RMY6CZy6QNMOpTJLDn11RD_9eVa31SEECt47Vqa k7Z81hCl5g6rH4FB9ZNW8UcqALGPQFdLjitHB_sNTAF7R.kChOrpD7ITgrgc iYtkK2cJ1XaWy4wzwGeDNqoS9lT2sdyeR9F2MxBx7H_rWxm8KqzgrfYiNRnJ wVNXpvsFn2EnUKgnbgntusqIaqlGigZ.yGJDNA3aZwXbZhg0yzviA0n79dTG vuTRHjXQ_yJk_Q64fXWYtSlJdlQOUNj9b.moMKsxGl.gdPArJZZZ2zYorOIt xtFzFGgN8QkDcJHnUczb_vCUTd6ysq_HtvN8hzQYDZftsjIO9sPVTIQ1HURA BbA-- Received: from [69.231.24.241] by web164506.mail.gq1.yahoo.com via HTTP; Thu, 23 Feb 2012 11:30:14 PST X-RocketYMMF: apurtell X-Mailer: YahooMailWebService/0.8.116.338427 References: <2CD9179D-41C8-4FAC-897D-B94E20D6AEE9@salesforce.com> <1330024921.56445.YahooMailNeo@web164501.mail.gq1.yahoo.com> Message-ID: <1330025414.28483.YahooMailNeo@web164506.mail.gq1.yahoo.com> Date: Thu, 23 Feb 2012 11:30:14 -0800 (PST) From: Andrew Purtell Reply-To: Andrew Purtell Subject: Re: Solr & HBase - Re: How is Data Indexed in HBase? To: "user@hbase.apache.org" Cc: "hbase-user@hadoop.apache.org" In-Reply-To: <1330024921.56445.YahooMailNeo@web164501.mail.gq1.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable To beat on this analogy further:=0A=0A"But it would be like using assembler= instead of Java or Ruby to build =0Athe server side of some website"=0A=0A= ... or if you are Facebook and you get really big but have a pile of PHP fo= r a code base, you make HipHop to convert that code to assembler :-) (in ef= fect)=0A=0AIn HBase land, someone hasn't had a scale itch for search big en= ough to make our "HipHop". Or might that some day be Solbase?=0A=A0=0ABest = regards,=0A=0A=0A=A0 =A0 - Andy=0A=0A=0AProblems worthy of attack prove the= ir worth by hitting back. - Piet Hein (via Tom White)=0A=0A=0A=0A----- Orig= inal Message -----=0A> From: Andrew Purtell =0A> To: "= user@hbase.apache.org" =0A> Cc: "hbase-user@hadoop.a= pache.org" =0A> Sent: Thursday, February 23, = 2012 11:22 AM=0A> Subject: Re: Solr & HBase - Re: How is Data Indexed in HB= ase?=0A> =0A> I'd also make a comment on this:=0A> =0A>> On Feb 22, 2012, = at 12:12 PM, Jacques wrote:=0A> =0A>> The key to keyword retrieval is the = construction of the data.=A0 Among other=0A>> things, this is one of the k= ey things that Solr is very good at: creating a=0A>> very efficient organi= zation of the data so that you can retrieve quickly.=0A>> At their core, S= olr, ElasticSearch, Lily and Katta all use Lucene to=0A>> construct this d= ata.=A0 HBase is bad at this.=0A> =0A> I can build an inverted index on top= of HBase for some form of full text search. =0A> But it would be like usin= g assembler instead of Java or Ruby to build the server =0A> side of some w= ebsite. Unless scale forces hyper-optimization for the use case, =0A> ES or= Solr is a better choice because then one doesn't have to do all of the =0A= > heavy lifting.=0A> =0A> Also, it doesn't have to be an either-or choice. = Projects like Lily and =0A> Solbase are interesting hybrids.=0A> =0A> =0A> = Best regards,=0A> =0A> =0A> =A0 =A0 - Andy=0A> =0A> Problems worthy of atta= ck prove their worth by hitting back. - Piet Hein (via =0A> Tom White)=0A> = =0A> =0A> =0A> ----- Original Message -----=0A>> From: Ian Varley =0A>> To: "user@hbase.apache.org" = =0A>> Cc: "hbase-user@hadoop.apache.org" =0A> =0A>> Sent: Wednesday, February 22, 2012 10:18 AM=0A>> Subject: Re: Sol= r & HBase - Re: How is Data Indexed in HBase?=0A>> =0A>> One minor clarifi= cation:=0A>> =0A>> HBase is primarily built for retrieving a single row at= a time based on a=0A>> predetermined and known location (the key).=0A>> = =0A>> Substitute that with: "HBase is primarily built for retrieving sets = of =0A> =0A>> contiguous sorted rows based on a predetermined and known lo= cation (the =0A> start =0A>> key)". Scans are fundamentally just as effici= ent in HBase as gets, =0A> because =0A>> row keys are sorted. In fact, Get= is just implemented as a 1-row Scan!=0A>> =0A>> This is one of the nice d= esign features that sets HBase (and similar =0A> stores) =0A>> apart from = straight key/value stores; you can do range scans of rows.=0A>> =0A>> Ian= =0A>> =0A>> On Feb 22, 2012, at 12:12 PM, Jacques wrote:=0A>> =0A>> Solr = does not provide a complex enough support to rank.=0A>> I believe Solr has= a bunch of plug-ability to write your own custom ranking=0A>> approach.= =A0 If you think you can't do your desired ranking with Solr, =0A>> you're= =0A>> probably wrong and need to ask for help from the Solr community.=0A>= > =0A>> retrieving data by keyword is one of them. I think Solr is a prope= r=0A>> choice=0A>> The key to keyword retrieval is the construction of th= e data.=A0 Among other=0A>> things, this is one of the key things that Sol= r is very good at: creating a=0A>> very efficient organization of the data= so that you can retrieve quickly.=0A>> At their core, Solr, ElasticSearch= , Lily and Katta all use Lucene to=0A>> construct this data.=A0 HBase is b= ad at this.=0A>> =0A>> how HBase support high performance when it needs to= keep consistency in=0A>> a large scale distributed system=0A>> HBase is = primarily built for retrieving a single row at a time based on a=0A>> pred= etermined and known location (the key).=A0 It is also very efficient at=0A>= > splitting massive datasets across multiple machines and allowing sequent= ial=0A>> batch analyses of these datasets.=A0 HBase can maintain high perf= ormance in=0A>> this way because consistency only ever exists at the row l= evel.=A0 This is=0A>> what HBase is good at.=0A>> =0A>> You need to focus= what you're doing and then write it out.=A0 Figure out =0A> how=0A>> you = think the pieces should work together.=A0 Read the documentation.=A0 Then,= =0A>> ask specific questions where you feel like the documentation is uncl= ear or=0A>> you feel confused.=A0 Your general questions are very difficul= t to answer in=0A>> any kind of really helpful way.=0A>> =0A>> thanks,=0A= >> Jacques=0A>> =0A>> =0A>> On Wed, Feb 22, 2012 at 9:51 AM, Bing Li =0A>= > > wrote:=0A>> =0A>> Mr Gupta,= =0A>> =0A>> Thanks so much for your reply!=0A>> =0A>> In my use cases, re= trieving data by keyword is one of them. I think Solr=0A>> is a proper cho= ice.=0A>> =0A>> However, Solr does not provide a complex enough support to= rank. And,=0A>> frequent updating is also not suitable in Solr. So it is = difficult to=0A>> retrieve data randomly based on the values other than ke= yword frequency in=0A>> text. In this case, I attempt to use HBase.=0A>> = =0A>> But I don't know how HBase support high performance when it needs to= =0A> keep=0A>> consistency in a large scale distributed system.=0A>> =0A>= > Now both of them are used in my system.=0A>> =0A>> I will check out Ela= sticSearch.=0A>> =0A>> Best regards,=0A>> Bing=0A>> =0A>> =0A>> On Thu, = Feb 23, 2012 at 1:35 AM, T Vinod Gupta =0A>> >wrote:=0A>> =0A>> Bing,=0A>> Its a classic battle= on whether to use solr or hbase or a combination of=0A>> both. both syste= ms are very different but there is some overlap in the=0A>> utility. they = also differ vastly when it compares to computation power,=0A>> storage nee= ds, etc. so in the end, it all boils down to your use case. you=0A>> need = to pick the technology that it best suited to your needs.=0A>> im still no= t clear on your use case though.=0A>> =0A>> btw, if you haven't started us= ing solr yet - then you might want to=0A>> checkout ElasticSearch. I spent= over a week researching between solr and ES=0A>> and eventually chose ES = due to its cool merits.=0A>> =0A>> thanks=0A>> =0A>> =0A>> On Wed, Feb 22= , 2012 at 9:31 AM, Ted Yu =0A>> > wrote:=0A>> =0A>> There is no secondary index support in HBase at = the moment.=0A>> =0A>> It's on our road map.=0A>> =0A>> FYI=0A>> =0A>> O= n Wed, Feb 22, 2012 at 9:28 AM, Bing Li =0A>> > wrote:=0A>> =0A>> Jacques,=0A>> =0A>> Yes. But I still ha= ve questions about that.=0A>> =0A>> In my system, when users search with a= keyword arbitrarily, the query=0A>> is=0A>> forwarded to Solr. No any up= dating operations but appending new indexes=0A>> exist in Solr managed dat= a.=0A>> =0A>> When I need to retrieve data based on ranking values, HBase = is used.=0A>> And,=0A>> the ranking values need to be updated all the tim= e.=0A>> =0A>> Is that correct?=0A>> =0A>> My question is that the perform= ance must be low if keeping consistency=0A>> in a=0A>> large scale distri= buted environment. How does HBase handle this issue?=0A>> =0A>> Thanks so = much!=0A>> =0A>> Bing=0A>> =0A>> =0A>> On Thu, Feb 23, 2012 at 1:17 AM, J= acques =0A>> > wrote:=0A>> =0A>>= It is highly unlikely that you could replace Solr with HBase.=0A>> They'= re=0A>> really apples and oranges.=0A>> =0A>> =0A>> On Wed, Feb 22, 2012 = at 1:09 AM, Bing Li =0A>> > wrot= e:=0A>> =0A>> Dear all,=0A>> =0A>> I wonder how data in HBase is indexed?= Now Solr is used in my system=0A>> because data is managed in inverted in= dex. Such an index is=0A>> suitable to=0A>> retrieve unstructured and hug= e amount of data. How does HBase deal=0A>> with=0A>> the=0A>> issue? May= I replaced Solr with HBase?=0A>> =0A>> Thanks so much!=0A>> =0A>> Best r= egards,=0A>> Bing=0A>> =0A>