nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From prateek sachdeva <prats86....@gmail.com>
Subject Re: Apache Nutch 1.16 Fetcher reducers?
Date Wed, 22 Jul 2020 14:09:35 GMT
ctly Thanks a lot Sebastian. Yes, after checking the logs i saw "key out of
order exception" and realized that MapFile expects entries to be in order
and MapFile is used in FetcherOutputFormat while writing data to HDFS. I
might have to create my own custom FetcherOutputFormat to allow out of
order writes. I will check how I can do that.

I will also try to merge parsing and avro conversion to fetch Job directly
so see if there are some improvements.

I have also concluded this discussion here -
https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/.
So if you want to add something here, please feel free to do so.

Regards
Prateek

On Tue, Jul 21, 2020 at 7:50 PM Sebastian Nagel
<wastl.nagel@googlemail.com.invalid> wrote:

> Hi Prateek,
>
> > if I do 0 reducers in
> > the Fetch phase, I am not getting all the urls in output that I seeded in
> > input. Looks like only a few of them made it to the final output.
>
> There should be error messages in the task logs caused by output not sorted
> by URL (used as key in map files).
>
>
> >> Final clarification - If I do fetcher.store.content=true and
> >> fetcher.parse=true, I don't need that Parse Job in my workflow and
> parsing
> >> will be done as part of fetcher flow only?
>
> Yes, parsing is then done in the fetcher and the parse output is written to
> crawl_parse, parse_text and parse_data.
>
> Best,
> Sebastian
>
> On 7/21/20 3:42 PM, prateek sachdeva wrote:
> > Correcting my statement below. I just realized that if I do 0 reducers in
> > the Fetch phase, I am not getting all the urls in output that I seeded in
> > input. Looks like only a few of them made it to the final output.
> > So something is not working as expected if we use 0 reducers in the Fetch
> > phase.
> >
> > Regards
> > Prateek
> >
> > On Tue, Jul 21, 2020 at 2:13 PM prateek sachdeva <prats86.irl@gmail.com>
> > wrote:
> >
> >> Makes complete sense. Agreed that 0 reducers in apache nutch fetcher
> won't
> >> make sense because of tooling that's built around it.
> >> Answering your questions - No, we have not made any changes to
> >> FetcherOutputFormat. Infact, the whole fetcher and parse job is the
> same as
> >> that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have
> >> built wrappers around these classes to run using Azkaban (
> >> https://azkaban.github.io/). And still it works if I assign 0 reducers
> in
> >> the Fetch phase.
> >>
> >> Final clarification - If I do fetcher.store.content=true and
> >> fetcher.parse=true, I don't need that Parse Job in my workflow and
> parsing
> >> will be done as part of fetcher flow only?
> >> Also, I agree with your point that if I modify FetcherOutputFormat to
> >> include avro conversion step, I might get rid of that as well. This will
> >> save some time for sure since Fetcher will be directly creating the
> final
> >> avro format that I need. So the only question remains is that if I do
> >> fetcher.parse=true, can I get rid of parse Job as a separate step
> >> completely.
> >>
> >> Regards
> >> Prateek
> >>
> >> On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel
> >> <wastl.nagel@googlemail.com.invalid> wrote:
> >>
> >>> Hi Prateek,
> >>>
> >>> (regarding 1.)
> >>>
> >>> It's also possible to combine fetcher.store.content=true and
> >>> fetcher.parse=true.
> >>> You might save some time unless the fetch job is CPU-bound - it usually
> >>> is limited by network and RAM for buffering content.
> >>>
> >>>> which code are you referring to?
> >>>
> >>> Maybe it isn't "a lot". The SegmentReader is assuming map files, and
> >>> there are probably
> >>> some more tools which also do.  If nothing is used in your workflow,
> >>> that's fine.
> >>> But if a fetcher without the reduce step should become the default for
> >>> Nutch, we'd
> >>> need to take care for all tools and also ensure backward-compatibility.
> >>>
> >>>
> >>>> FYI- I tried running with 0 reducers
> >>>
> >>> I assume you've also adapted FetcherOutputFormat ?
> >>>
> >>> Btw., you could think about inlining the "avroConversion" (or parts of
> >>> it) into FetcherOutputFormat which also could remove the need to
> >>> store the content.
> >>>
> >>> Best,
> >>> Sebastian
> >>>
> >>>
> >>> On 7/21/20 11:28 AM, prateek sachdeva wrote:
> >>>> Hi Sebastian,
> >>>>
> >>>> Thanks for your reply. Couple of questions -
> >>>>
> >>>> 1. We have customized apache nutch jobs a bit like this. We have a
> >>> separate parse job (ParseSegment.java) after fetch job (Fetcher.java).
> So
> >>>> as suggested above, if I use fetcher.store.content=false, I am
> assuming
> >>> the "content" folder will not be created and hence our parse job
> >>>> won't work because it takes the content folder as an input file. Also,
> >>> we have added an additional step "avroConversion" which takes input
> >>>> as "parse_data", "parse_text", "content" and "crawl_fetch" and
> converts
> >>> into a specific avro schema defined by us. So I think, I will end up
> >>>> breaking a lot of things if I add fetcher.store.content=false and do
> >>> parsing in the fetch phase only (fetcher.parse=true)
> >>>>
> >>>> image.png
> >>>>
> >>>> 2. In your earlier email, you said "a lot of code accessing the
> >>> segments still assumes map files", which code are you referring to? In
> my
> >>>> use case above, we are not sending the crawled output to any indexers.
> >>> In the avro conversion step, we just convert data into avro schema
> >>>> and dump to HDFS. Do you think we still need reducers in the fetch
> >>> phase? FYI- I tried running with 0 reducers and don't see any impact as
> >>>> such.
> >>>>
> >>>> Appreciate your help.
> >>>>
> >>>> Regards
> >>>> Prateek
> >>>>
> >>>> On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel <
> >>> wastl.nagel@googlemail.com.invalid> wrote:
> >>>>
> >>>>     Hi Prateek,
> >>>>
> >>>>     you're right there is no specific reducer used but without a
> reduce
> >>> step
> >>>>     the segment data isn't (re)partitioned and the data isn't sorted.
> >>>>     This was a strong requirement once Nutch was a complete search
> >>> engine
> >>>>     and the "content" subdir of a segment was used as page cache.
> >>>>     Getting the content from a segment is fast if the segment is
> >>> partitioned
> >>>>     in a predictable way (hash partitioning) and map files are used.
> >>>>
> >>>>     Well, this isn't a strong requirement anymore, since Nutch uses
> >>> Solr,
> >>>>     Elasticsearch or other index services. But a lot of code accessing
> >>>>     the segments still assumes map files. Removing the reduce step
> from
> >>>>     the fetcher would also mean a lot of work in code and tools
> >>> accessing
> >>>>     the segments, esp. to ensure backward compatibility.
> >>>>
> >>>>     Have you tried to run the fetcher with
> >>>>      fetcher.parse=true
> >>>>      fetcher.store.content=false ?
> >>>>     This will save a lot of time and without the need to write the
> large
> >>>>     raw content the reduce phase should be fast, only a small fraction
> >>>>     (5-10%) of the fetcher map phase.
> >>>>
> >>>>     Best,
> >>>>     Sebastian
> >>>>
> >>>>
> >>>>     On 7/20/20 11:38 PM, prateek sachdeva wrote:
> >>>>     > Hi Guys,
> >>>>     >
> >>>>     > As per Apache Nutch 1.16 Fetcher class implementation here
-
> >>>>     >
> >>>
> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java
> >>> ,
> >>>>     > this is a map only job. I don't see any reducer set in the
Job.
> >>> So my
> >>>>     > question is why not set job.setNumreduceTasks(0) and save the
> >>> time by
> >>>>     > outputting directly to HDFS.
> >>>>     >
> >>>>     > Regards
> >>>>     > Prateek
> >>>>     >
> >>>>
> >>>
> >>>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message