hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From schubert zhang <zson...@gmail.com>
Subject Re: MR Job question
Date Wed, 04 Mar 2009 11:31:01 GMT
In my job, I can tell the MR job the startRow and endRow, i.e. a row
range. Then my MR job can only scan the region(s) in the range, and should
not scan from begin of table or tablet/region to the end.

So,  Slava, you should modify you code of MR job to do what you want.

Schubert

On Wed, Mar 4, 2009 at 4:58 PM, Slava Gorelik <slava.gorelik@gmail.com>wrote:

> Hi.I'm confused a little bit.
>
> Please correct me if I wrong, but MR Job is it self is "scanning" all rows
> in the table. The job is spread into each region server, into
> multiple threads. Each thread get some part of the rows that are placed in
> particular region server. So, the MR jobs is finished when all
> threads are passed over all rows. Filtering will help the MR job only to
> filter out non-relevant rows, but any way those rows will be checked
> (passed
> to the filter), this not helps a lot, job still passing over all rows in
> the
> table. Calling a scanner inside MR Job, will not
> prevent from the job to pass over all rows, it simple will make job
> more heavy(as i understand that). Is it correct, Michael ?
>
> So, my question is how can I tell to MR Job to pass over some rows and not
> all rows.
>
> Thank You and Best Regards.
> Slava.
>
>
> On Wed, Mar 4, 2009 at 8:57 AM, stack <stack@duboce.net> wrote:
>
> > On Tue, Mar 3, 2009 at 6:17 PM, schubert zhang <zsongbo@gmail.com>
> wrote:
> >
> > > Yes, we can tell HBase API only scan rows start with a key.
> > >
> >
> > Would the filtering feature help here?
> >
> >
> >
> http://hadoop.apache.org/hbase/docs/r0.19.0/api/org/apache/hadoop/hbase/filter/package-summary.html#package_description
> >
> > Scanners can be passed a filter (Read the description section on the
> above
> > url).
> >
> >
> > Can any expert share your ideas about:
> > > 1. If the rowkey is not chronological, how can I only process the newly
> > > added/updated rows?
> >
> >
> > We don't have a means of asking for versions before a timestamp, only
> older
> > (Can you add timestamp to your row key if you need this?)
> >
> >
> > > 2. How can I remove the old rows which are inserted three months ago?
> > >
> >
> > See above.
> >
> > St.Ack
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message