nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ Chen <ajc...@web2express.org>
Subject Re: Seeking Insight into Nutch Configurations
Date Mon, 02 Aug 2010 22:08:22 GMT
Does anyone have a EC2 image that runs smoothly for >3000 domains?  If a
sample of the complete nutch&hadoop configurations for distributed crawling
on EC2 is available for the community, that will help anyone to learn nutch
best practice quickly.
-aj


On Mon, Aug 2, 2010 at 1:59 PM, Scott Gonyea <scott@aitrus.org> wrote:

> By the way, can anyone tell me if there is a way to explicitly limit how
> many pages should be fetched, per fetcher-task?
>
> I know that one limit to this, is that each site/domain/whatever could
> exceed that limit (assuming the limit were lower than the number of those
> sites).  For politeness, that limit would have to be soft.  But that's more
> than suitable, in my opinion.
>
> I think part of the problem is that, seemingly, Nutch seems to be
> generating
> some really unbalanced fetcher tasks.
>
> The task (task_201008021617_0026_m_000000) had 6859 pages to fetch.  Each
> higher-numbered task had fewer pages to fetch.  Task 000180 only had 44
> pages to fetch.
>
> This *huge* imbalance, I think, creates tasks that are seemingly
> unpredictable.  All of my other resources just sit around, wasting
> resources, until one task grabs some crazy number of sites.
>
> Thanks again,
> sg
>
> On Mon, Aug 2, 2010 at 11:57 AM, Scott Gonyea <me@sgonyea.com> wrote:
>
> > Thank you very much, Adrzej.  I'm really hoping some people can share
> some
> > non-sensitive details of their setup.  I'm really curious about the
> > following:
> >
> > The ratio of Maps to Reduces for their nutch jobs?
> > The amount of memory that they allocate to each job task?
> > The number of simultaneous Maps/Reduces on any given host?
> > The number of fetcher threads they execute?
> >
> > Any config setup people can share would be great, so I can have a
> different
> > perspective on how people setup their nutch-site and mapred-site files.
> >
> > At the moment, I'm experimenting with the following configs:
> >
> > http://gist.github.com/505065
> >
> > I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run at
> > any given time.  I have Nutch firing off 181 Maps and 41 Reduces.  Those
> are
> > both prime numbers, but I don't know if that really matters.  I've seen
> > Hadoop say that the number of reducers should be around the number of
> nodes
> > you have (the nearest prime).  I've seen, somewhere, some suggestions
> that
> > Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have
> insight
> > to share on that?
> >
> > Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.  I'm
> > waiting for it to return to the 4th fetch step, so I can see why Nutch
> hates
> > me so much.
> >
> > sg
> >
> > On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <ab@getopt.org> wrote:
> >
> >> On 2010-08-02 10:17, Scott Gonyea wrote:
> >>
> >>> The big problem that I am facing, thus far, occurs on the 4th fetch.
> >>> All but 1 or 2 maps complete. All of the running reduces stall (0.00
> >>> MB/s), presumably because they are waiting on that map to finish? I
> >>> really don't know and it's frustrating.
> >>>
> >>
> >> Yes, all map tasks need to finish before reduce tasks are able to
> proceed.
> >> The reason is that each reduce task receives a portion of the keyspace
> (and
> >> values) according to the Partitioner, and in order to prepare a nice
> <key,
> >> list(value)> in your reducer it needs to, well, get all the values under
> >> this key first, whichever map task produced the tuples, and then sort
> them.
> >>
> >> The failing tasks probably fail due to some other factor, and very
> likely
> >> (based on my experience) the failure is related to some particular URLs.
> >> E.g. regex URL filtering can choke on some pathological URLs, like URLs
> 20kB
> >> long, or containing '\0' etc, etc. In my experience, it's best to keep
> regex
> >> filtering to a minimum if you can, and use other urlfilters (prefix,
> domain,
> >> suffix, custom) to limit your crawling frontier. There are simply too
> many
> >> ways where a regex engine can lock up.
> >>
> >> Please check the logs of the failing tasks. If you see that a task is
> >> stalled you could also log in to the node, and generate a thread dump a
> few
> >> times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the
> regex
> >> processing then it's likely this is your problem.
> >>
> >>
> >>  My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
> >>> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
> >>> Storage: I've performed crawls with HDFS and with amazon S3. I
> >>> thought S3 would be more performant, yet it doesn't appear to affect
> >>> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
> >>> to get it done quickly... But I can't imagine I need much more than
> >>> 10-20 mid-size instances for this.
> >>>
> >>
> >> That's correct - with this number of unique sites the max. throughput of
> >> your crawl will be ultimately limited by the politeness limits (# of
> >> requests/site/sec).
> >>
> >>
> >>
> >>> Can anyone share their own experiences in the performance they've
> >>> seen?
> >>>
> >>
> >> There is a very simple benchmark in trunk/ that you could use to measure
> >> the raw performance (data processing throughput) of your EC2 cluster.
> The
> >> real-life performance, though, will depend on many other factors, such
> as
> >> the number of unique sites, their individual speed, and (rarely) the
> total
> >> bandwidth at your end.
> >>
> >>
> >> --
> >> Best regards,
> >> Andrzej Bialecki     <><
> >>  ___. ___ ___ ___ _ _   __________________________________
> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> http://www.sigram.com  Contact: info at sigram dot com
> >>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message