hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Slava Gorelik <slava.gore...@gmail.com>
Subject Re: MR Job question
Date Wed, 04 Mar 2009 12:45:23 GMT
How can you tell that ? There no interface in MR Job definition that allows
that.Every sample of MR Job in Hbase is works like that (this is a map from
RowCounter):

public void map(ImmutableBytesWritable row, RowResult value,
    OutputCollector<ImmutableBytesWritable, RowResult> output,
    @SuppressWarnings("unused") Reporter reporter)
  throws IOException {
    boolean content = false;
    for (Map.Entry<byte [], Cell> e: value.entrySet()) {
      Cell cell = e.getValue();
      if (cell != null && cell.getValue().length > 0) {
        content = true;
        break;
      }
    }
    if (!content) {
      return;
    }

You can't say which rows you want to get.

Best Regards.
Slava.


On Wed, Mar 4, 2009 at 1:31 PM, schubert zhang <zsongbo@gmail.com> wrote:

> In my job, I can tell the MR job the startRow and endRow, i.e. a row
> range. Then my MR job can only scan the region(s) in the range, and should
> not scan from begin of table or tablet/region to the end.
>
> So,  Slava, you should modify you code of MR job to do what you want.
>
> Schubert
>
> On Wed, Mar 4, 2009 at 4:58 PM, Slava Gorelik <slava.gorelik@gmail.com
> >wrote:
>
> > Hi.I'm confused a little bit.
> >
> > Please correct me if I wrong, but MR Job is it self is "scanning" all
> rows
> > in the table. The job is spread into each region server, into
> > multiple threads. Each thread get some part of the rows that are placed
> in
> > particular region server. So, the MR jobs is finished when all
> > threads are passed over all rows. Filtering will help the MR job only to
> > filter out non-relevant rows, but any way those rows will be checked
> > (passed
> > to the filter), this not helps a lot, job still passing over all rows in
> > the
> > table. Calling a scanner inside MR Job, will not
> > prevent from the job to pass over all rows, it simple will make job
> > more heavy(as i understand that). Is it correct, Michael ?
> >
> > So, my question is how can I tell to MR Job to pass over some rows and
> not
> > all rows.
> >
> > Thank You and Best Regards.
> > Slava.
> >
> >
> > On Wed, Mar 4, 2009 at 8:57 AM, stack <stack@duboce.net> wrote:
> >
> > > On Tue, Mar 3, 2009 at 6:17 PM, schubert zhang <zsongbo@gmail.com>
> > wrote:
> > >
> > > > Yes, we can tell HBase API only scan rows start with a key.
> > > >
> > >
> > > Would the filtering feature help here?
> > >
> > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.19.0/api/org/apache/hadoop/hbase/filter/package-summary.html#package_description
> > >
> > > Scanners can be passed a filter (Read the description section on the
> > above
> > > url).
> > >
> > >
> > > Can any expert share your ideas about:
> > > > 1. If the rowkey is not chronological, how can I only process the
> newly
> > > > added/updated rows?
> > >
> > >
> > > We don't have a means of asking for versions before a timestamp, only
> > older
> > > (Can you add timestamp to your row key if you need this?)
> > >
> > >
> > > > 2. How can I remove the old rows which are inserted three months ago?
> > > >
> > >
> > > See above.
> > >
> > > St.Ack
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message