hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John <johnnyenglish...@gmail.com>
Subject Re: RE: Add Columnsize Filter for Scan Operation
Date Fri, 25 Oct 2013 23:17:06 GMT
@Dhaval: Thanks! I did'nt know that. I've created now a field in the Mapper
class which stores information about the map() before. That works fine for
me.

regards,
john


2013/10/25 Dhaval Shah <prince_mithibai@yahoo.co.in>

> John, an important point to note here is that even though rows will get
> split over multiple calls to scanner.next(), all batches of 1 row will
> always reach 1 mapper. Another important point to note is that these
> batches will appear in consecutive calls to mapper.map()
>
> What this means is that you don't need to send your data to the reducer
> (and be more efficient by not writing to disk, no shuffle/sort phases and
> so on). You can just keep the state in memory for a particular row being
> processed (effectively a running count on the number of columns) and make
> the final decision when the row ends (effectively you encounter a different
> row or all rows are exhausted and you reach the cleanup function).
>
> The way I would do it is a map only MR job which keeps the state in memory
> as described above and uses the KeyOnlyFilter to reduce the amount of data
> flowing to the mapper
>
> Regards,
> Dhaval
>
>
> ________________________________
>  From: John <johnnyenglish739@gmail.com>
> To: user@hbase.apache.org; lars hofhansl <larsh@apache.org>
> Sent: Friday, 25 October 2013 8:02 AM
> Subject: Re: RE: Add Columnsize Filter for Scan Operation
>
>
> One thing I could do is to drop every batch-row where the column-size is
> smaller than the batch size. Something like if(rowsize < batchsize-1) drop
> row. The problem with this version is that the last row of a big row is
> also droped. Here a little example:
> There is one row:
> row1: 3500 columns
>
> If I set the batch to 1000. the mapper function got for the first row
>
> 1. Iteration: map function got 1000 columns -> write to disk for the
> reducer
> 2. Iteration map function got 1000 columns -> write to disk for the reducer
> 3. Iteration map function got 1000 columns -> write to disk for the reducer
> 4. Iteration map function got 500 columns -> drop, because it's smaller
> than the batch size
>
> Is there a way to count the columns over different map-functions?
>
> regards
>
>
>
> 2013/10/25 John <johnnyenglish739@gmail.com>
>
> > I try to build a MR-Job, but in my case that doesn't work. Because if I
> > set for example the batch to 1000 and there are 5000 columns in row. Now
> i
> > found to generate something for rows where are the column size is bigger
> > than 2500. BUT since the map function is executed for every batch-row i
> > can't say if the row has a size bigger than 2500.
> >
> > any ideas?
> >
> >
> > 2013/10/25 lars hofhansl <larsh@apache.org>
> >
> >> We need to finish up HBASE-8369
> >>
> >>
> >>
> >> ________________________________
> >>  From: Dhaval Shah <prince_mithibai@yahoo.co.in>
> >> To: "user@hbase.apache.org" <user@hbase.apache.org>
> >> Sent: Thursday, October 24, 2013 4:38 PM
> >> Subject: Re: RE: Add Columnsize Filter for Scan Operation
> >>
> >>
> >> Well that depends on your use case ;)
> >>
> >> There are many nuances/code complexities to keep in mind:
> >> - merging results of various HFiles (each region can have.more than one)
> >> - merging results of WAL
> >> - applying delete markers
> >> - how about data which is only in memory of region servers and no where
> >> else
> >> - applying bloom filters for efficiency
> >> - what about hbase filters?
> >>
> >> At some point you would basically start rewriting an hbase region server
> >> on you map reduce job which is not ideal for maintainability.
> >>
> >> Do we ever read MySQL data files directly or issue a SQL query? Kind of
> >> goes back to the same argument ;)
> >>
> >> Sent from Yahoo Mail on Android
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message