Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 9102 invoked from network); 2 Dec 2010 21:46:11 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Dec 2010 21:46:11 -0000 Received: (qmail 60696 invoked by uid 500); 2 Dec 2010 21:46:10 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 60671 invoked by uid 500); 2 Dec 2010 21:46:10 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 60663 invoked by uid 99); 2 Dec 2010 21:46:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Dec 2010 21:46:10 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [80.190.178.166] (HELO mail.digital.tis.bz.it) (80.190.178.166) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Dec 2010 21:46:02 +0000 Received: from [192.168.0.2] (host115-111-dynamic.45-79-r.retail.telecomitalia.it [79.45.111.115]) by mail.digital.tis.bz.it (Postfix) with ESMTPSA id 1B53C123A002 for ; Thu, 2 Dec 2010 22:45:40 +0100 (CET) Message-ID: <4CF81384.505@tis.bz.it> Date: Thu, 02 Dec 2010 22:45:40 +0100 From: Claudio Martella User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.12) Gecko/20101027 Thunderbird/3.1.6 MIME-Version: 1.0 To: user@hbase.apache.org Subject: Re: incremental counters and a global String->Long Dictionary References: <4CF3C2E0.2070804@tis.bz.it> <2D6136772A13B84E95DF6DA79E85A9F00130CAD30930@NSPEXMBX-A.the-lab.llnl.gov> <4CF50521.2000905@tis.bz.it> <4CF7BE5E.5040300@tis.bz.it> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Ok, I read it as the non existence of the column, not the whole key. My b= ad. On 12/2/10 10:43 PM, Stack wrote: > I think it does already Claudio: > > http://hbase.apache.org/docs/r0.89.20100924/apidocs/org/apache/hadoop/h= base/client/HTable.html#checkAndPut(byte[], > byte[], byte[], byte[], org.apache.hadoop.hbase.client.Put) > > St.Ack > > On Thu, Dec 2, 2010 at 7:42 AM, Claudio Martella > wrote: >> Hi Ryan, >> >> yes that would help for sure. Shouldn't this feature be documented? >> >> Thanks >> >> >> On 12/1/10 4:03 AM, Ryan Rawson wrote: >>> CheckAndPut interprets a 'null' value argument as a check for >>> existence. That is if you set the expected value to null it will onl= y >>> succeed if the value does not exist. >>> >>> Would that help? >>> >>> -ryan >>> >>> On Tue, Nov 30, 2010 at 6:07 AM, Claudio Martella >>> wrote: >>>> Hi Dave, >>>> >>>> thanks for you idea. I also considered this possibility. Although th= e >>>> possibility of a collision is very small, what scares me is the fact= >>>> that i don't think the corruption can be corrected. >>>> I can for sure detect it afterwards in O(NlogN) time by scanning the= >>>> table, but correcting my long-based corpus is impossible. Once the >>>> database is converted, the information is lost. >>>> >>>> >>>> On 11/30/10 1:43 AM, Buttler, David wrote: >>>>> A while back I had a strange idea to bypass this problem: create a = 64-bit hash code for the word. Your word space should be significantly s= maller than 64 bits, so a good hash algorithm (the top 64 bits of sha1 sa= y) should make collisions extremely rare. And, if you can always check y= our dictionary later for collisions if this feels wrong. >>>>> This should be a good deal simpler than trying to keep around an or= der dependent integer mapping for your dictionary. And, it is somewhat r= ecoverable if you ever lose your dictionary for some reason. >>>>> >>>>> Dave >>>>> >>>>> -----Original Message----- >>>>> From: Claudio Martella [mailto:claudio.martella@tis.bz.it] >>>>> Sent: Monday, November 29, 2010 7:13 AM >>>>> To: user@hbase.apache.org >>>>> Subject: incremental counters and a global String->Long Dictionary >>>>> >>>>> Hello list, >>>>> >>>>> I'm kind of new to HBase, so I'll post this email with a request fo= r >>>>> comment. >>>>> Very briefly, I do a lot of text processing with mapreduce, so it's= very >>>>> useful for me to convert string to longs, so i can make my computat= ions >>>>> faster. >>>>> >>>>> My corpus keeps on growing and I want this String->Long mapping to = be >>>>> persistent and dynamical (i want to add new mappings when i find ne= w words). >>>>> At the moment i'm tackling the problem this way (pseudo-code): >>>>> >>>>> longvalue =3D convert(word) # gets from hbase >>>>> if longvalue =3D=3D -1: >>>>> longvalue =3D insert(word) # puts in hbase >>>>> >>>>> longvalue now contains the new mapped value. This approach requires= a >>>>> global counter that saves the latest mapped long and increments at = every >>>>> insert. I can easily do this two ways. A special row in hbase "_cou= nter" >>>>> that I increment through IncrementColumnValue, or creating a sequen= tial >>>>> non-ephemeral znode in zookeeper and use the version as my counter.= The >>>>> first one is of course faster. So the solution would be: >>>>> >>>>> insert(word): >>>>> longvalue =3D hbase.incrementColumnValue("_counter", "v") >>>>> hbase.put(word, longvalue) >>>>> return longvalue >>>>> >>>>> The problem is that between the time i realize there's no mapping f= or my >>>>> word and the time i insert the new longvalue, somebody else might h= ave >>>>> done the same for me, so I have a corrupted dictionary. >>>>> >>>>> One possible solution would be to acquire a lock on the "_counter" = row, >>>>> recheck for the presence of the mapping and then insert my new valu= e: >>>>> >>>>> safe_insert(word): >>>>> lock("_counter") >>>>> longvalue =3D convert(word) >>>>> if longvalue =3D=3D -1: #nobody inserted the mapping in the mea= ntime >>>>> longvalue =3D insert(word) >>>>> unlock("_counter") >>>>> return longvalue >>>>> >>>>> This way the counter row, with its lock, would behave as a global l= ock. >>>>> This would solve my problems but would create a bottleneck (althoug= h >>>>> with time my inserts tend to get very rare as the dictionary grows)= =2E A >>>>> solution to this problem would be to have locks on zookeeper based = on words. >>>>> >>>>> ZKsafe_insert(word): >>>>> ZKlock("/words/"+ word) >>>>> longvalue =3D convert(word) >>>>> if longvalue =3D=3D -1: #nobody inserted the mapping in the mea= ntime >>>>> longvalue =3D insert(word) >>>>> ZKunlock("/words/"+word) >>>>> return longvalue >>>>> >>>>> This of course would allow me to have more finegrained locks and be= tter >>>>> scalability, but I'd relay on a system with higher latency (ZK). >>>>> >>>>> Does anybody have a better solution with hbase? I guess using >>>>> hbase_transational would also be a possibility, but again, what abo= ut >>>>> speed and the actual issues with the package (like recovering in th= e >>>>> face of hregion failure). >>>>> >>>>> >>>>> Thank you, >>>>> >>>>> Claudio >>>>> >>>> -- >>>> Claudio Martella >>>> Digital Technologies >>>> Unit Research & Development - Analyst >>>> >>>> TIS innovation park >>>> Via Siemens 19 | Siemensstr. 19 >>>> 39100 Bolzano | 39100 Bozen >>>> Tel. +39 0471 068 123 >>>> Fax +39 0471 068 129 >>>> claudio.martella@tis.bz.it http://www.tis.bz.it >>>> >>>> Short information regarding use of personal data. According to Secti= on 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform yo= u that we process your personal data in order to fulfil contractual and f= iscal obligations and also to send you information regarding our services= and events. Your personal data are processed with and without electronic= means and by respecting data subjects' rights, fundamental freedoms and = dignity, particularly with regard to confidentiality, personal identity a= nd the right to personal data protection. At any time and without formali= ties you can write an e-mail to privacy@tis.bz.it in order to object the = processing of your personal data for the purpose of sending advertising m= aterials and also to exercise the right to access personal data and other= rights referred to in Section 7 of Decree 196/2003. The data controller = is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You c= an find the complete information on the web site www.tis.bz.it. >>>> >>>> >>>> >> >> -- >> Claudio Martella >> Digital Technologies >> Unit Research & Development - Analyst >> >> TIS innovation park >> Via Siemens 19 | Siemensstr. 19 >> 39100 Bolzano | 39100 Bozen >> Tel. +39 0471 068 123 >> Fax +39 0471 068 129 >> claudio.martella@tis.bz.it http://www.tis.bz.it >> >> Short information regarding use of personal data. According to Section= 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you = that we process your personal data in order to fulfil contractual and fis= cal obligations and also to send you information regarding our services a= nd events. Your personal data are processed with and without electronic m= eans and by respecting data subjects' rights, fundamental freedoms and di= gnity, particularly with regard to confidentiality, personal identity and= the right to personal data protection. At any time and without formaliti= es you can write an e-mail to privacy@tis.bz.it in order to object the pr= ocessing of your personal data for the purpose of sending advertising mat= erials and also to exercise the right to access personal data and other r= ights referred to in Section 7 of Decree 196/2003. The data controller is= TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can= find the complete information on the web site www.tis.bz.it. >> >> >> --=20 Claudio Martella Digital Technologies Unit Research & Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 claudio.martella@tis.bz.it http://www.tis.bz.it Short information regarding use of personal data. According to Section 13= of Italian Legislative Decree no. 196 of 30 June 2003, we inform you tha= t we process your personal data in order to fulfil contractual and fiscal= obligations and also to send you information regarding our services and = events. Your personal data are processed with and without electronic mean= s and by respecting data subjects' rights, fundamental freedoms and digni= ty, particularly with regard to confidentiality, personal identity and th= e right to personal data protection. At any time and without formalities = you can write an e-mail to privacy@tis.bz.it in order to object the proce= ssing of your personal data for the purpose of sending advertising materi= als and also to exercise the right to access personal data and other righ= ts referred to in Section 7 of Decree 196/2003. The data controller is TI= S Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can fi= nd the complete information on the web site www.tis.bz.it.