hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: incremental counters and a global String->Long Dictionary
Date Mon, 29 Nov 2010 18:35:12 GMT
You might try http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/client/HTable.html#checkAndPut(byte[],
byte[], byte[], byte[], org.apache.hadoop.hbase.client.Put)

St.Ack

On Mon, Nov 29, 2010 at 10:03 AM, Claudio Martella
<claudio.martella@tis.bz.it> wrote:
> Hi Lars,
>
> thanks for you answer. Yes, I read Percolator's paper, but I'd like to
> get my problem solved with existing software solution, and i like HBase.
> The ephemeral node is, i think, my last solution i proposed, the one i
> called ZKsafe_insert(). Or?
>
> On 11/29/10 6:35 PM, Lars George wrote:
>> Hi Claudio,
>>
>> Did you have a look at Google's Percolator paper? I think a mechanism like this may
work. Another option often used to implement distributed transactions is using Zookeeper where
you could create an ephemeral node on the new word and the host succeeding to do so is adding
it and then releasing the lock. Or some such.
>>
>> Lars
>>
>> On Nov 29, 2010, at 16:12, Claudio Martella <claudio.martella@tis.bz.it> wrote:
>>
>>> Hello list,
>>>
>>> I'm kind of new to HBase, so I'll post this email with a request for
>>> comment.
>>> Very briefly, I do a lot of text processing with mapreduce, so it's very
>>> useful for me to convert string to longs, so i can make my computations
>>> faster.
>>>
>>> My corpus keeps on growing and I want this String->Long mapping to be
>>> persistent and dynamical (i want to add new mappings when i find new words).
>>> At the moment i'm tackling the problem this way (pseudo-code):
>>>
>>> longvalue = convert(word) # gets from hbase
>>> if longvalue == -1:
>>>    longvalue = insert(word) # puts in hbase
>>>
>>> longvalue now contains the new mapped value. This approach requires a
>>> global counter that saves the latest mapped long and increments at every
>>> insert. I can easily do this two ways. A special row in hbase "_counter"
>>> that I increment through IncrementColumnValue, or creating a sequential
>>> non-ephemeral znode in zookeeper and use the version as my counter. The
>>> first one is of course faster. So the solution would be:
>>>
>>> insert(word):
>>>    longvalue = hbase.incrementColumnValue("_counter", "v")
>>>    hbase.put(word, longvalue)
>>>    return longvalue
>>>
>>> The problem is that between the time i realize there's no mapping for my
>>> word and the time i insert the new longvalue, somebody else might have
>>> done the same for me, so I have a corrupted dictionary.
>>>
>>> One possible solution would be to acquire a lock on the "_counter" row,
>>> recheck for the presence of the mapping and then insert my new value:
>>>
>>> safe_insert(word):
>>>    lock("_counter")
>>>    longvalue = convert(word)
>>>    if longvalue == -1: #nobody inserted the mapping in the meantime
>>>        longvalue = insert(word)
>>>    unlock("_counter")
>>>    return longvalue
>>>
>>> This way the counter row, with its lock, would behave as a global lock.
>>> This would solve my problems but would create a bottleneck (although
>>> with time my inserts tend to get very rare as the dictionary grows). A
>>> solution to this problem would be to have locks on zookeeper based on words.
>>>
>>> ZKsafe_insert(word):
>>>    ZKlock("/words/"+ word)
>>>    longvalue = convert(word)
>>>    if longvalue == -1: #nobody inserted the mapping in the meantime
>>>        longvalue = insert(word)
>>>    ZKunlock("/words/"+word)
>>>    return longvalue
>>>
>>> This of course would allow me to have more finegrained locks and better
>>> scalability, but I'd relay on a system with higher latency (ZK).
>>>
>>> Does anybody have a better solution with hbase? I guess using
>>> hbase_transational would also be a possibility, but again, what about
>>> speed and the actual issues with the package (like recovering in the
>>> face of hregion failure).
>>>
>>>
>>> Thank you,
>>>
>>> Claudio
>>>
>>> --
>>> Claudio Martella
>>> Digital Technologies
>>> Unit Research & Development - Analyst
>>>
>>> TIS innovation park
>>> Via Siemens 19 | Siemensstr. 19
>>> 39100 Bolzano | 39100 Bozen
>>> Tel. +39 0471 068 123
>>> Fax  +39 0471 068 129
>>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>>
>>> Short information regarding use of personal data. According to Section 13 of
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal
data in order to fulfil contractual and fiscal obligations and also to send you information
regarding our services and events. Your personal data are processed with and without electronic
means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly
with regard to confidentiality, personal identity and the right to personal data protection.
At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order
to object the processing of your personal data for the purpose of sending advertising materials
and also to exercise the right to access personal data and other rights referred to in Section
7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street
n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>>>
>>>
>
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> claudio.martella@tis.bz.it http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13 of Italian
Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data
in order to fulfil contractual and fiscal obligations and also to send you information regarding
our services and events. Your personal data are processed with and without electronic means
and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with
regard to confidentiality, personal identity and the right to personal data protection. At
any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to
object the processing of your personal data for the purpose of sending advertising materials
and also to exercise the right to access personal data and other rights referred to in Section
7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street
n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>
>
>

Mime
View raw message