nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Nagel <wastl.na...@googlemail.com.INVALID>
Subject Re: Apache Nutch 1.16 Fetcher reducers?
Date Mon, 27 Jul 2020 09:36:30 GMT
> might have to create my own custom FetcherOutputFormat to allow out of
> order writes. I will check how I can do that.

Just replace the MapFile.Writer by a SequenceFile.Writer
Eventually, this may require further changes.

> I have also concluded this discussion here -
> https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/.

Thanks for updating the discussion there!

On 7/22/20 4:09 PM, prateek sachdeva wrote:
> ctly Thanks a lot Sebastian. Yes, after checking the logs i saw "key out of
> order exception" and realized that MapFile expects entries to be in order
> and MapFile is used in FetcherOutputFormat while writing data to HDFS. I
> might have to create my own custom FetcherOutputFormat to allow out of
> order writes. I will check how I can do that.
> 
> I will also try to merge parsing and avro conversion to fetch Job directly
> so see if there are some improvements.
> 
> I have also concluded this discussion here -
> https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/.
> So if you want to add something here, please feel free to do so.
> 
> Regards
> Prateek
> 
> On Tue, Jul 21, 2020 at 7:50 PM Sebastian Nagel
> <wastl.nagel@googlemail.com.invalid> wrote:
> 
>> Hi Prateek,
>>
>>> if I do 0 reducers in
>>> the Fetch phase, I am not getting all the urls in output that I seeded in
>>> input. Looks like only a few of them made it to the final output.
>>
>> There should be error messages in the task logs caused by output not sorted
>> by URL (used as key in map files).
>>
>>
>>>> Final clarification - If I do fetcher.store.content=true and
>>>> fetcher.parse=true, I don't need that Parse Job in my workflow and
>> parsing
>>>> will be done as part of fetcher flow only?
>>
>> Yes, parsing is then done in the fetcher and the parse output is written to
>> crawl_parse, parse_text and parse_data.
>>
>> Best,
>> Sebastian
>>
>> On 7/21/20 3:42 PM, prateek sachdeva wrote:
>>> Correcting my statement below. I just realized that if I do 0 reducers in
>>> the Fetch phase, I am not getting all the urls in output that I seeded in
>>> input. Looks like only a few of them made it to the final output.
>>> So something is not working as expected if we use 0 reducers in the Fetch
>>> phase.
>>>
>>> Regards
>>> Prateek
>>>
>>> On Tue, Jul 21, 2020 at 2:13 PM prateek sachdeva <prats86.irl@gmail.com>
>>> wrote:
>>>
>>>> Makes complete sense. Agreed that 0 reducers in apache nutch fetcher
>> won't
>>>> make sense because of tooling that's built around it.
>>>> Answering your questions - No, we have not made any changes to
>>>> FetcherOutputFormat. Infact, the whole fetcher and parse job is the
>> same as
>>>> that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have
>>>> built wrappers around these classes to run using Azkaban (
>>>> https://azkaban.github.io/). And still it works if I assign 0 reducers
>> in
>>>> the Fetch phase.
>>>>
>>>> Final clarification - If I do fetcher.store.content=true and
>>>> fetcher.parse=true, I don't need that Parse Job in my workflow and
>> parsing
>>>> will be done as part of fetcher flow only?
>>>> Also, I agree with your point that if I modify FetcherOutputFormat to
>>>> include avro conversion step, I might get rid of that as well. This will
>>>> save some time for sure since Fetcher will be directly creating the
>> final
>>>> avro format that I need. So the only question remains is that if I do
>>>> fetcher.parse=true, can I get rid of parse Job as a separate step
>>>> completely.
>>>>
>>>> Regards
>>>> Prateek
>>>>
>>>> On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel
>>>> <wastl.nagel@googlemail.com.invalid> wrote:
>>>>
>>>>> Hi Prateek,
>>>>>
>>>>> (regarding 1.)
>>>>>
>>>>> It's also possible to combine fetcher.store.content=true and
>>>>> fetcher.parse=true.
>>>>> You might save some time unless the fetch job is CPU-bound - it usually
>>>>> is limited by network and RAM for buffering content.
>>>>>
>>>>>> which code are you referring to?
>>>>>
>>>>> Maybe it isn't "a lot". The SegmentReader is assuming map files, and
>>>>> there are probably
>>>>> some more tools which also do.  If nothing is used in your workflow,
>>>>> that's fine.
>>>>> But if a fetcher without the reduce step should become the default for
>>>>> Nutch, we'd
>>>>> need to take care for all tools and also ensure backward-compatibility.
>>>>>
>>>>>
>>>>>> FYI- I tried running with 0 reducers
>>>>>
>>>>> I assume you've also adapted FetcherOutputFormat ?
>>>>>
>>>>> Btw., you could think about inlining the "avroConversion" (or parts of
>>>>> it) into FetcherOutputFormat which also could remove the need to
>>>>> store the content.
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>>
>>>>> On 7/21/20 11:28 AM, prateek sachdeva wrote:
>>>>>> Hi Sebastian,
>>>>>>
>>>>>> Thanks for your reply. Couple of questions -
>>>>>>
>>>>>> 1. We have customized apache nutch jobs a bit like this. We have
a
>>>>> separate parse job (ParseSegment.java) after fetch job (Fetcher.java).
>> So
>>>>>> as suggested above, if I use fetcher.store.content=false, I am
>> assuming
>>>>> the "content" folder will not be created and hence our parse job
>>>>>> won't work because it takes the content folder as an input file.
Also,
>>>>> we have added an additional step "avroConversion" which takes input
>>>>>> as "parse_data", "parse_text", "content" and "crawl_fetch" and
>> converts
>>>>> into a specific avro schema defined by us. So I think, I will end up
>>>>>> breaking a lot of things if I add fetcher.store.content=false and
do
>>>>> parsing in the fetch phase only (fetcher.parse=true)
>>>>>>
>>>>>> image.png
>>>>>>
>>>>>> 2. In your earlier email, you said "a lot of code accessing the
>>>>> segments still assumes map files", which code are you referring to? In
>> my
>>>>>> use case above, we are not sending the crawled output to any indexers.
>>>>> In the avro conversion step, we just convert data into avro schema
>>>>>> and dump to HDFS. Do you think we still need reducers in the fetch
>>>>> phase? FYI- I tried running with 0 reducers and don't see any impact
as
>>>>>> such.
>>>>>>
>>>>>> Appreciate your help.
>>>>>>
>>>>>> Regards
>>>>>> Prateek
>>>>>>
>>>>>> On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel <
>>>>> wastl.nagel@googlemail.com.invalid> wrote:
>>>>>>
>>>>>>     Hi Prateek,
>>>>>>
>>>>>>     you're right there is no specific reducer used but without a
>> reduce
>>>>> step
>>>>>>     the segment data isn't (re)partitioned and the data isn't sorted.
>>>>>>     This was a strong requirement once Nutch was a complete search
>>>>> engine
>>>>>>     and the "content" subdir of a segment was used as page cache.
>>>>>>     Getting the content from a segment is fast if the segment is
>>>>> partitioned
>>>>>>     in a predictable way (hash partitioning) and map files are used.
>>>>>>
>>>>>>     Well, this isn't a strong requirement anymore, since Nutch uses
>>>>> Solr,
>>>>>>     Elasticsearch or other index services. But a lot of code accessing
>>>>>>     the segments still assumes map files. Removing the reduce step
>> from
>>>>>>     the fetcher would also mean a lot of work in code and tools
>>>>> accessing
>>>>>>     the segments, esp. to ensure backward compatibility.
>>>>>>
>>>>>>     Have you tried to run the fetcher with
>>>>>>      fetcher.parse=true
>>>>>>      fetcher.store.content=false ?
>>>>>>     This will save a lot of time and without the need to write the
>> large
>>>>>>     raw content the reduce phase should be fast, only a small fraction
>>>>>>     (5-10%) of the fetcher map phase.
>>>>>>
>>>>>>     Best,
>>>>>>     Sebastian
>>>>>>
>>>>>>
>>>>>>     On 7/20/20 11:38 PM, prateek sachdeva wrote:
>>>>>>     > Hi Guys,
>>>>>>     >
>>>>>>     > As per Apache Nutch 1.16 Fetcher class implementation here
-
>>>>>>     >
>>>>>
>> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java
>>>>> ,
>>>>>>     > this is a map only job. I don't see any reducer set in the
Job.
>>>>> So my
>>>>>>     > question is why not set job.setNumreduceTasks(0) and save
the
>>>>> time by
>>>>>>     > outputting directly to HDFS.
>>>>>>     >
>>>>>>     > Regards
>>>>>>     > Prateek
>>>>>>     >
>>>>>>
>>>>>
>>>>>
>>>
>>
>>
> 


Mime
View raw message