nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From prateek sachdeva <prats86....@gmail.com>
Subject Re: Apache Nutch 1.16 Fetcher reducers?
Date Tue, 21 Jul 2020 13:13:52 GMT
Makes complete sense. Agreed that 0 reducers in apache nutch fetcher won't
make sense because of tooling that's built around it.
Answering your questions - No, we have not made any changes to
FetcherOutputFormat. Infact, the whole fetcher and parse job is the same as
that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have
built wrappers around these classes to run using Azkaban (
https://azkaban.github.io/). And still it works if I assign 0 reducers in
the Fetch phase.

Final clarification - If I do fetcher.store.content=true and
fetcher.parse=true, I don't need that Parse Job in my workflow and parsing
will be done as part of fetcher flow only?
Also, I agree with your point that if I modify FetcherOutputFormat to
include avro conversion step, I might get rid of that as well. This will
save some time for sure since Fetcher will be directly creating the final
avro format that I need. So the only question remains is that if I do
fetcher.parse=true, can I get rid of parse Job as a separate step
completely.

Regards
Prateek

On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel
<wastl.nagel@googlemail.com.invalid> wrote:

> Hi Prateek,
>
> (regarding 1.)
>
> It's also possible to combine fetcher.store.content=true and
> fetcher.parse=true.
> You might save some time unless the fetch job is CPU-bound - it usually is
> limited by network and RAM for buffering content.
>
> > which code are you referring to?
>
> Maybe it isn't "a lot". The SegmentReader is assuming map files, and there
> are probably
> some more tools which also do.  If nothing is used in your workflow,
> that's fine.
> But if a fetcher without the reduce step should become the default for
> Nutch, we'd
> need to take care for all tools and also ensure backward-compatibility.
>
>
> > FYI- I tried running with 0 reducers
>
> I assume you've also adapted FetcherOutputFormat ?
>
> Btw., you could think about inlining the "avroConversion" (or parts of it)
> into FetcherOutputFormat which also could remove the need to
> store the content.
>
> Best,
> Sebastian
>
>
> On 7/21/20 11:28 AM, prateek sachdeva wrote:
> > Hi Sebastian,
> >
> > Thanks for your reply. Couple of questions -
> >
> > 1. We have customized apache nutch jobs a bit like this. We have a
> separate parse job (ParseSegment.java) after fetch job (Fetcher.java). So
> > as suggested above, if I use fetcher.store.content=false, I am assuming
> the "content" folder will not be created and hence our parse job
> > won't work because it takes the content folder as an input file. Also,
> we have added an additional step "avroConversion" which takes input
> > as "parse_data", "parse_text", "content" and "crawl_fetch" and converts
> into a specific avro schema defined by us. So I think, I will end up
> > breaking a lot of things if I add fetcher.store.content=false and do
> parsing in the fetch phase only (fetcher.parse=true)
> >
> > image.png
> >
> > 2. In your earlier email, you said "a lot of code accessing the segments
> still assumes map files", which code are you referring to? In my
> > use case above, we are not sending the crawled output to any indexers.
> In the avro conversion step, we just convert data into avro schema
> > and dump to HDFS. Do you think we still need reducers in the fetch
> phase? FYI- I tried running with 0 reducers and don't see any impact as
> > such.
> >
> > Appreciate your help.
> >
> > Regards
> > Prateek
> >
> > On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel <
> wastl.nagel@googlemail.com.invalid> wrote:
> >
> >     Hi Prateek,
> >
> >     you're right there is no specific reducer used but without a reduce
> step
> >     the segment data isn't (re)partitioned and the data isn't sorted.
> >     This was a strong requirement once Nutch was a complete search engine
> >     and the "content" subdir of a segment was used as page cache.
> >     Getting the content from a segment is fast if the segment is
> partitioned
> >     in a predictable way (hash partitioning) and map files are used.
> >
> >     Well, this isn't a strong requirement anymore, since Nutch uses Solr,
> >     Elasticsearch or other index services. But a lot of code accessing
> >     the segments still assumes map files. Removing the reduce step from
> >     the fetcher would also mean a lot of work in code and tools accessing
> >     the segments, esp. to ensure backward compatibility.
> >
> >     Have you tried to run the fetcher with
> >      fetcher.parse=true
> >      fetcher.store.content=false ?
> >     This will save a lot of time and without the need to write the large
> >     raw content the reduce phase should be fast, only a small fraction
> >     (5-10%) of the fetcher map phase.
> >
> >     Best,
> >     Sebastian
> >
> >
> >     On 7/20/20 11:38 PM, prateek sachdeva wrote:
> >     > Hi Guys,
> >     >
> >     > As per Apache Nutch 1.16 Fetcher class implementation here -
> >     >
> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java
> ,
> >     > this is a map only job. I don't see any reducer set in the Job. So
> my
> >     > question is why not set job.setNumreduceTasks(0) and save the time
> by
> >     > outputting directly to HDFS.
> >     >
> >     > Regards
> >     > Prateek
> >     >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message