hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: PrefixFilter performance question.
Date Thu, 10 Dec 2009 21:06:44 GMT
On Tue, Dec 8, 2009 at 11:43 PM, stack <stack@duboce.net> wrote:
> Try using this filter instead:
>
>      scan.setFilter(FirstKeyOnlyFilter.new())
>
> Will only return row keys, if thats the effect you are looking for.
>
> St.Ack
>
>
> On Tue, Dec 8, 2009 at 3:30 PM, Edward Capriolo <edlinuxguru@gmail.com>wrote:
>
>> On Tue, Dec 8, 2009 at 6:00 PM, Andrew Purtell <apurtell@apache.org>
>> wrote:
>> > I added an entry to the troubleshooting page up on the wiki:
>> >
>> >    http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A16
>> >
>> >  - Andy
>> >
>> >
>> >
>> >
>> >
>> > ________________________________
>> > From: Ryan Rawson <ryanobjc@gmail.com>
>> > To: hbase-user@hadoop.apache.org
>> > Sent: Tue, December 8, 2009 5:21:25 PM
>> > Subject: Re: PrefixFilter performance question.
>> >
>> > You want:
>> >
>> >
>> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/client/HTable.html#scannerCaching
>> >
>> > The default is low because if a job takes too long processing, a
>> > scanner can time out, which causes unhappy jobs/people/emails.
>> >
>> > BTW I can read small rows out of a 19 node cluster at 7 million
>> > rows/sec using a map-reduce program.  Any individual process is doing
>> > 40k+ rows/sec or so
>> >
>> > -ryan
>> >
>> > On Tue, Dec 8, 2009 at 12:25 PM, Edward Capriolo <edlinuxguru@gmail.com>
>> wrote:
>> >> Hey all,
>> >>
>> >> I have been doing some performance evaluation with mysql vs hbase.
>> >>
>> >> I have a table webtable
>> >> {NAME => 'webdata', FAMILIES => [{NAME => 'anchor', COMPRESSION
=>
>> >> 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536',
>> >> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'image',
>> >> COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE
>> >> => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME
=>
>> >> 'raw_data', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
>> >> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
>> >> => 'true'}]}
>> >>
>> >> I have a normalized version in mysql. I currently have loaded
>> >>
>> >> nyhadoopdev6:60030      1260289750689   requests=4, regions=3,
>> usedHeap=99, maxHeap=997
>> >> nyhadoopdev7:60030      1260289862481   requests=0, regions=2,
>> usedHeap=181,
>> >> maxHeap=997
>> >> nyhadoopdev8:60030      1260289909059   requests=0, regions=2,
>> usedHeap=395,
>> >> maxHeap=997
>> >>
>> >> This is a snippet here.
>> >>
>> >> if (mysql) {
>> >>       try {
>> >>        PreparedStatement ps = conn.prepareStatement("SELECT * FROM
>> >> page WHERE page LIKE (?)");
>> >>        ps.setString(1,"http://www.s%");
>> >>        ResultSet rs = ps.executeQuery();
>> >>        while (rs.next() ){
>> >>          sPageCount++;
>> >>        }
>> >>        rs.close();
>> >>        ps.close();
>> >>       } catch (SQLException ex) {System.out.println(ex); System.exit(1);
>> }
>> >>      }
>> >>
>> >>      if (hbase) {
>> >>        Scan s = new Scan();
>> >>        //s.setCacheBlocks(true);
>> >>        s.setFilter( new PrefixFilter(Bytes.toBytes("http://www.s") )
);
>> >>        ResultScanner scanner = table.getScanner(s);
>> >>        try {
>> >>          for (Result rr:scanner){
>> >>            sPageCount++;
>> >>          }
>> >>       } finally {
>> >>         scanner.close();
>> >>       }
>> >>
>> >>      }
>> >>
>> >> I am seeing about .3 MS from mysql and 20. second performance from
>> >> Hbase. I have read some tuning docs but most seem geared for insertion
>> >> speed, not search speed. I would think this would be a
>> >> Bread-and-butter search for hbase since the row keys are naturally
>> >> sorted lexicographically. I am not running a giant setup here, 3
>> >> nodes, 2x replication, but I would think that it is almost a non
>> >> factor here since these data is fairly small. Hints ?
>> >>
>> >
>> >
>> >
>> >
>>
>> I raised this to from 1-30 -> 18 sec
>> I raised this to 100 ->17 sec
>> I raised this to 1000 ->OOM
>>
>> The OOM pointed me in the direction that this comparison is not apples
>> to apples. In mysql the page table is normalized, but in HBASE it is
>> not. I see lots of data moving across the wire.
>>
>> I tried to filter to just move the ROW key across the wire but I do
>> not think I have it right...
>>
>>  List<Filter> filters = new ArrayList<Filter>();
>>        filters.add( new PrefixFilter(Bytes.toBytes("http://www.s") ) ) ;
>>        filters.add( new QualifierFilter( CompareOp.EQUAL, new
>> BinaryComparator(
>>
>> Bytes.toBytes("ROW")) ) );
>>        Filter f =new FilterList(Operator.MUST_PASS_ALL, filters);
>>        s.setFilter(f);
>>         ResultScanner scanner = table.getScanner(s);
>>
>

I have added the smallest family I have.

 s.addFamily( Bytes.toBytes("anchor") )

This drops the search to
spage_time:2266 ms

second consecutive search takes

~1000 ms

This is more reasonable, the time discrepancy now could be explained
because each entry has 5-10 random anchors associated with it.

I have used CE HBase 0.20.0 RPM. and guess what I do not have?
FirstKeyOnlyFilter :) I really like the layout Hbase layout/init
scripts this RPM provides. I can't seem to find the src.rpm for it
anywhere. If I do not find it in a few days, I might just to latest or
trunk. (side note Does anyone have the source RPM?)

Mime
View raw message