hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Slava Gorelik <slava.gore...@gmail.com>
Subject Re: MR Job question
Date Wed, 04 Mar 2009 08:58:42 GMT
Hi.I'm confused a little bit.

Please correct me if I wrong, but MR Job is it self is "scanning" all rows
in the table. The job is spread into each region server, into
multiple threads. Each thread get some part of the rows that are placed in
particular region server. So, the MR jobs is finished when all
threads are passed over all rows. Filtering will help the MR job only to
filter out non-relevant rows, but any way those rows will be checked (passed
to the filter), this not helps a lot, job still passing over all rows in the
table. Calling a scanner inside MR Job, will not
prevent from the job to pass over all rows, it simple will make job
more heavy(as i understand that). Is it correct, Michael ?

So, my question is how can I tell to MR Job to pass over some rows and not
all rows.

Thank You and Best Regards.
Slava.


On Wed, Mar 4, 2009 at 8:57 AM, stack <stack@duboce.net> wrote:

> On Tue, Mar 3, 2009 at 6:17 PM, schubert zhang <zsongbo@gmail.com> wrote:
>
> > Yes, we can tell HBase API only scan rows start with a key.
> >
>
> Would the filtering feature help here?
>
>
> http://hadoop.apache.org/hbase/docs/r0.19.0/api/org/apache/hadoop/hbase/filter/package-summary.html#package_description
>
> Scanners can be passed a filter (Read the description section on the above
> url).
>
>
> Can any expert share your ideas about:
> > 1. If the rowkey is not chronological, how can I only process the newly
> > added/updated rows?
>
>
> We don't have a means of asking for versions before a timestamp, only older
> (Can you add timestamp to your row key if you need this?)
>
>
> > 2. How can I remove the old rows which are inserted three months ago?
> >
>
> See above.
>
> St.Ack
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message