hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Beaudreault <bbeaudrea...@hubspot.com>
Subject Re: Fastest way to find is a row exist?
Date Fri, 04 Jan 2013 20:45:35 GMT
Why do you want to remove the bloom filter?  I think you should keep the
bloom filter but also use the KeyOnlyFilter to cut down on data transferred
over the wire.


On Fri, Jan 4, 2013 at 3:28 PM, Jean-Marc Spaggiari <jean-marc@spaggiari.org
> wrote:

> Ok. I have activate them on 2 of my main tables and I will re-run the
> job and see.
>
> 2 other questions then ;)
>
> 1) I have activated them that way: alter 'work_proposed', NAME => '@',
> BLOOMFILTER => 'ROW' how can I remove them?
> 2) Should I major_compact to make sure all the hash are stored?
>
> Thanks,
>
> JM
>
> 2013/1/4, Adrien Mogenet <adrien.mogenet@gmail.com>:
> > On every Get, BloomFilter is acting as a filter (!) on top of each HFile
> > and allows to check if a key is absent from the HFile. So yes, you will
> > benefit from these filters.
> >
> >
> > On Fri, Jan 4, 2013 at 8:58 PM, Jean-Marc Spaggiari
> > <jean-marc@spaggiari.org
> >> wrote:
> >
> >> Is KeyOnlyFilter using the BloomFilters too?
> >>
> >> Here is, with more details, what I'm doing.
> >>
> >> Few questions.
> >> - Can I create one single KeyOnlyFilter and give the same filter to
> >> all the gets?
> >> - Will bloom filters benefit in such scenario? My key is small. Let's
> >> say average 128 bytes.
> >>
> >> The goal here is to check about 500 entries at a time to validate if
> >> they already exist or not.
> >>
> >> In my MR, I'm starting when I have more than 100K lines to handle, and
> >> each line car have up to 1K entries. So it can result up to 100M
> >> gets... Job took initially 500 minutes to complete. I have added few
> >> pretty good nodes and it's not taking less than 300 minutes. But I
> >> would like to get under 100 minutes if I can...
> >>
> >> Thanks,
> >>
> >> JM
> >>
> >>         Vector<Get> gets_entry_exist = new Vector<Get>();
> >>         for (Entry entry : entries.getEntries())
> >>         {
> >>                 Get entry_exist = new Get(entry.toKey());
> >>                 entry_exist.setFilter(new KeyOnlyFilter());
> >>                 gets_entry_exist.add(entry_exist);
> >>         }
> >>
> >>         Result[] result_entry_exist = table_entry.get(gets_entry_exist);
> >>
> >>         int index = 0;
> >>         for (Entry entry : entries.getEntries())
> >>         {
> >>                 boolean isEmpty =
>  result_entry_exist[index++].isEmpty();
> >>                 if (isEmpty)
> >>                 {
> >>                         // Process here
> >>                 }
> >>         }
> >>                                                 {
> >>
> >>
> >> 2013/1/4, Damien Hardy <dhardy@viadeoteam.com>:
> >> > Hello Jean-Marc,
> >> >
> >> > BloomFilters are just designed for that.
> >> >
> >> > But they say if a row doesn't exist with a ash of the key (not the
> >> oposit,
> >> > 2 rowkeys could have the same ash result).
> >> >
> >> > If you want to be sure the rowkey exists you have to search for it in
> >> > the
> >> > HFile ( the whole mechanism is transparent with the get() ).
> >> >
> >> > Their is also an KeOnlyFilter
> >> >
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html
> >> > preventing from getting the whole columns of the existing key as
> return
> >> > (which could be heavy).
> >> >
> >> > Cheers,
> >> >
> >> > --
> >> > Damien
> >> >
> >> >
> >> > 2013/1/4 Jean-Marc Spaggiari <jean-marc@spaggiari.org>
> >> >
> >> >> Hi,
> >> >>
> >> >> What's the fastest way to know if a row exist?
> >> >>
> >> >> Today I'm doing that:
> >> >>
> >> >> Get get_entry_exist = new Get(key).addColumn(CF_DATA, C_DATA);
> >> >> Result entry_exist = table_entry.get(get_entry_exist);
> >> >>
> >> >> But should this be faster?
> >> >> Get get_entry_exist = new Get(key);
> >> >> Result entry_exist = table_entry.get(get_entry_exist);
> >> >>
> >> >> There is only one CF and one C on my table.
> >> >>
> >> >> Or is there an even faster way?
> >> >>
> >> >> Also, is there a way to make that even faster? I think BloomFilters
> >> >> can help, right?
> >> >>
> >> >> Thanks,
> >> >>
> >> >> JM
> >> >>
> >> >
> >>
> >
> >
> >
> > --
> > Adrien Mogenet
> > 06.59.16.64.22
> > http://www.mogenet.me
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message