hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ninad Raut <hbase.user.ni...@gmail.com>
Subject Re: Crawling Using HBase as a back end --Issue
Date Mon, 20 Apr 2009 16:37:13 GMT
Nutch 650 looks good.. vl test it .Thanks for the direction. ...

On Mon, Apr 20, 2009 at 9:48 PM, stack <stack@duboce.net> wrote:

> Ninad:
>
> Are you using Nutch crawling?  If not, out of interest, why not?  Have you
> seen NUTCH-650 -- it works I believe (jdcryans?).
>
> Your PermalinkTable is small?  Has only a few rows?   Maybe down the size
> at
> which this table splits by changing flush and maximum file size -- see
> hbase-default.xml.
>
> St.Ack
>
> On Mon, Apr 20, 2009 at 4:14 AM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
>
> > Ninad,
> >
> > Regards the timeouts, I recently gave a tip in the thread "Tip when
> > scanning and spending a lot of time on each row" which should solve
> > your problem.
> >
> > Regards your table, you should split it. In the shell, type the
> > command "tools" to see how to use the "split" command. Issue a couple
> > of them, waiting a bit between each call.
> >
> > J-D
> >
> > On Mon, Apr 20, 2009 at 5:49 AM, Ninad Raut <hbase.user.ninad@gmail.com>
> > wrote:
> > > Hi,
> > >
> > > I have been trying crawling data using MapReduce on HBase. Here is the
> > scenario:
> > >
> > > 1) I have a Fetch list which has all the permalinks to be fetched
> > > .They are stored in a PermalinkTable
> > >
> > > 2) A MapReduce scans over each permalink and tries fetching for the
> > > data and dumping it in ContentTable.
> > >
> > > Here are the issues I face:
> > >
> > > The permalink table is not split so I have just one map running on a
> > > single machine. The use of mapreduce gets nullified.
> > >
> > > The map reduce keeps givinf scanner time our exceptions causing task
> > > failures and further delays.
> > >
> > >
> > > If any one can give me tips for this use case it would really help me.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message