hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Hash indexing of HFiles
Date Fri, 15 Jul 2011 16:24:52 GMT
How do you figure the N in the below Claudio?  And the hash is a
function that respects the sort?  Hadoop MapFile has something similar
where the index entry is made every M entries (irrespective of size).
Any chance of you trying out your suggestion in hfile?   IIRC, we have
performance evaluation for various file types (You might be interested
in this recent posting by Mikhail Bautin of an hfile v2).

St.Ack

On Fri, Jul 15, 2011 at 7:58 AM, Claudio Martella
<claudio.martella@tis.bz.it> wrote:
> Hi Michal,
>
>
> what I was talking about is more of a vector-of-offsets kind of approach
> in stead of the Btree created by the "block starting with key x"
> approach which is used right now. Imagine that after the Records segment
> you have a vector of N longs (in stead of the block records we have
> right now), where N=the number of key/value pairs in the file. You get
> the right item inside of the vector by doing hash(key) % N, and read the
> exact position of the record inside of the file (which you can use for a
> direct seek). This is naive, of course, because it doesn't handle
> collisions, but should make the idea simple to understand. F.e. to
> handle collisions the offset could be to the bucket (a linked-list)
> after the vector. I've implemented this approach here:
>
> https://github.com/claudiomartella/sketches
>
> and it has very good random read performance (faster than leveldb, in my
> preliminary micro-benchmarks).
>
>
> On 7/15/11 4:48 PM, Michael Segel wrote:
>> Claudio,
>>
>> I'm not sure on how to answer this...
>>
>> Yes, we've got a prototype of a Lucene on HBase w Spatial that we're starting to
test.
>>
>> With respect to hashing...
>> In one project we just hashed the key using the SHA-1 hash already in Java. This
gave us the randomness without having to try to build a separate index.
>> But we're still using the base key for the row. Its not like we're creating a secondary
index on a column value.
>>
>> There are a couple of other projects out there on Git Hub so you may want to check
them out.
>>
>> HTH
>>
>> -Mike
>>
>>
>>> Date: Fri, 15 Jul 2011 14:32:50 +0200
>>> From: claudio.martella@tis.bz.it
>>> To: user@hbase.apache.org
>>> Subject: Hash indexing of HFiles
>>>
>>> Hello list,
>>>
>>> at SIGMOD this year i've seen a spreading of different storage files for
>>> HBase, with different techniques. My scenario and usage doesn't really
>>> require range queries, so I thought I'd take advantage of even faster
>>> random i/o from hash indexing of data in each sequence file.
>>>
>>> Does anybody know if anybody has developed other indexing techniques for
>>> sequence files other than Btrees?
>>>
>>>
>>> Thanks!
>>>
>>> --
>>> Claudio Martella
>>> Free Software & Open Technologies
>>> Analyst
>>>
>>> TIS innovation park
>>> Via Siemens 19 | Siemensstr. 19
>>> 39100 Bolzano | 39100 Bozen
>>> Tel. +39 0471 068 123
>>> Fax  +39 0471 068 129
>>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>>
>>> Short information regarding use of personal data. According to Section 13 of
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal
data in order to fulfil contractual and fiscal obligations and also to send you information
regarding our services and events. Your personal data are processed with and without electronic
means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly
with regard to confidentiality, personal identity and the right to personal data protection.
At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order
to object the processing of your personal data for the purpose of sending advertising materials
and also to exercise the right to access personal data and other rights referred to in Section
7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street
n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>>>
>>>
>>>
>>>
>>
>
>
> --
> Claudio Martella
> Free Software & Open Technologies
> Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> claudio.martella@tis.bz.it http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13 of Italian
Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data
in order to fulfil contractual and fiscal obligations and also to send you information regarding
our services and events. Your personal data are processed with and without electronic means
and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with
regard to confidentiality, personal identity and the right to personal data protection. At
any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to
object the processing of your personal data for the purpose of sending advertising materials
and also to exercise the right to access personal data and other rights referred to in Section
7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street
n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>
>
>
>
>

Mime
View raw message