hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhaval Shah <prince_mithi...@yahoo.co.in>
Subject Re: RE: Add Columnsize Filter for Scan Operation
Date Fri, 25 Oct 2013 14:23:02 GMT
John, an important point to note here is that even though rows will get split over multiple
calls to scanner.next(), all batches of 1 row will always reach 1 mapper. Another important
point to note is that these batches will appear in consecutive calls to mapper.map()

What this means is that you don't need to send your data to the reducer (and be more efficient
by not writing to disk, no shuffle/sort phases and so on). You can just keep the state in
memory for a particular row being processed (effectively a running count on the number of
columns) and make the final decision when the row ends (effectively you encounter a different
row or all rows are exhausted and you reach the cleanup function).

The way I would do it is a map only MR job which keeps the state in memory as described above
and uses the KeyOnlyFilter to reduce the amount of data flowing to the mapper

 From: John <johnnyenglish739@gmail.com>
To: user@hbase.apache.org; lars hofhansl <larsh@apache.org> 
Sent: Friday, 25 October 2013 8:02 AM
Subject: Re: RE: Add Columnsize Filter for Scan Operation

One thing I could do is to drop every batch-row where the column-size is
smaller than the batch size. Something like if(rowsize < batchsize-1) drop
row. The problem with this version is that the last row of a big row is
also droped. Here a little example:
There is one row:
row1: 3500 columns

If I set the batch to 1000. the mapper function got for the first row

1. Iteration: map function got 1000 columns -> write to disk for the reducer
2. Iteration map function got 1000 columns -> write to disk for the reducer
3. Iteration map function got 1000 columns -> write to disk for the reducer
4. Iteration map function got 500 columns -> drop, because it's smaller
than the batch size

Is there a way to count the columns over different map-functions?


2013/10/25 John <johnnyenglish739@gmail.com>

> I try to build a MR-Job, but in my case that doesn't work. Because if I
> set for example the batch to 1000 and there are 5000 columns in row. Now i
> found to generate something for rows where are the column size is bigger
> than 2500. BUT since the map function is executed for every batch-row i
> can't say if the row has a size bigger than 2500.
> any ideas?
> 2013/10/25 lars hofhansl <larsh@apache.org>
>> We need to finish up HBASE-8369
>> ________________________________
>>  From: Dhaval Shah <prince_mithibai@yahoo.co.in>
>> To: "user@hbase.apache.org" <user@hbase.apache.org>
>> Sent: Thursday, October 24, 2013 4:38 PM
>> Subject: Re: RE: Add Columnsize Filter for Scan Operation
>> Well that depends on your use case ;)
>> There are many nuances/code complexities to keep in mind:
>> - merging results of various HFiles (each region can have.more than one)
>> - merging results of WAL
>> - applying delete markers
>> - how about data which is only in memory of region servers and no where
>> else
>> - applying bloom filters for efficiency
>> - what about hbase filters?
>> At some point you would basically start rewriting an hbase region server
>> on you map reduce job which is not ideal for maintainability.
>> Do we ever read MySQL data files directly or issue a SQL query? Kind of
>> goes back to the same argument ;)
>> Sent from Yahoo Mail on Android
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message