hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Fastest way to find is a row exist?
Date Sat, 05 Jan 2013 13:29:19 GMT
Hum, very interesting!

Now, what's the best option? Array of get which will retrieve more
information? Or multiple HTable.exits one by one?

The best will have been to have an array of gets passed to the
exist... I will see how big it is to add that...

JM

2013/1/4, Mohamed Ibrahim <m0brhm@gmail.com>:
> What about HTable.exists ??
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#exists(org.apache.hadoop.hbase.client.Get)
>
> I think that should work if the Get has only the row key.
>
> Mohamed
>
>
> On Fri, Jan 4, 2013 at 3:17 PM, Adrien Mogenet
> <adrien.mogenet@gmail.com>wrote:
>
>> On every Get, BloomFilter is acting as a filter (!) on top of each HFile
>> and allows to check if a key is absent from the HFile. So yes, you will
>> benefit from these filters.
>>
>>
>> On Fri, Jan 4, 2013 at 8:58 PM, Jean-Marc Spaggiari <
>> jean-marc@spaggiari.org
>> > wrote:
>>
>> > Is KeyOnlyFilter using the BloomFilters too?
>> >
>> > Here is, with more details, what I'm doing.
>> >
>> > Few questions.
>> > - Can I create one single KeyOnlyFilter and give the same filter to
>> > all the gets?
>> > - Will bloom filters benefit in such scenario? My key is small. Let's
>> > say average 128 bytes.
>> >
>> > The goal here is to check about 500 entries at a time to validate if
>> > they already exist or not.
>> >
>> > In my MR, I'm starting when I have more than 100K lines to handle, and
>> > each line car have up to 1K entries. So it can result up to 100M
>> > gets... Job took initially 500 minutes to complete. I have added few
>> > pretty good nodes and it's not taking less than 300 minutes. But I
>> > would like to get under 100 minutes if I can...
>> >
>> > Thanks,
>> >
>> > JM
>> >
>> >         Vector<Get> gets_entry_exist = new Vector<Get>();
>> >         for (Entry entry : entries.getEntries())
>> >         {
>> >                 Get entry_exist = new Get(entry.toKey());
>> >                 entry_exist.setFilter(new KeyOnlyFilter());
>> >                 gets_entry_exist.add(entry_exist);
>> >         }
>> >
>> >         Result[] result_entry_exist =
>> > table_entry.get(gets_entry_exist);
>> >
>> >         int index = 0;
>> >         for (Entry entry : entries.getEntries())
>> >         {
>> >                 boolean isEmpty =
>> > result_entry_exist[index++].isEmpty();
>> >                 if (isEmpty)
>> >                 {
>> >                         // Process here
>> >                 }
>> >         }
>> >                                                 {
>> >
>> >
>> > 2013/1/4, Damien Hardy <dhardy@viadeoteam.com>:
>> > > Hello Jean-Marc,
>> > >
>> > > BloomFilters are just designed for that.
>> > >
>> > > But they say if a row doesn't exist with a ash of the key (not the
>> > oposit,
>> > > 2 rowkeys could have the same ash result).
>> > >
>> > > If you want to be sure the rowkey exists you have to search for it in
>> the
>> > > HFile ( the whole mechanism is transparent with the get() ).
>> > >
>> > > Their is also an KeOnlyFilter
>> > >
>> >
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html
>> > > preventing from getting the whole columns of the existing key as
>> > > return
>> > > (which could be heavy).
>> > >
>> > > Cheers,
>> > >
>> > > --
>> > > Damien
>> > >
>> > >
>> > > 2013/1/4 Jean-Marc Spaggiari <jean-marc@spaggiari.org>
>> > >
>> > >> Hi,
>> > >>
>> > >> What's the fastest way to know if a row exist?
>> > >>
>> > >> Today I'm doing that:
>> > >>
>> > >> Get get_entry_exist = new Get(key).addColumn(CF_DATA, C_DATA);
>> > >> Result entry_exist = table_entry.get(get_entry_exist);
>> > >>
>> > >> But should this be faster?
>> > >> Get get_entry_exist = new Get(key);
>> > >> Result entry_exist = table_entry.get(get_entry_exist);
>> > >>
>> > >> There is only one CF and one C on my table.
>> > >>
>> > >> Or is there an even faster way?
>> > >>
>> > >> Also, is there a way to make that even faster? I think BloomFilters
>> > >> can help, right?
>> > >>
>> > >> Thanks,
>> > >>
>> > >> JM
>> > >>
>> > >
>> >
>>
>>
>>
>> --
>> Adrien Mogenet
>> 06.59.16.64.22
>> http://www.mogenet.me
>>
>

Mime
View raw message