nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashanka Balakuntala <shbalakunt...@gmail.com>
Subject Re: Reconfiguring scoring plugin
Date Thu, 23 Jul 2020 17:23:11 GMT
Hi Patrick,
Yes, I did want to mention that it will not affect previous fetch lists.
Sorry for the confusion.

Thanks,
Shashanka Balakuntala


On Thu, 23 Jul 2020, 22:40 Patrick Mézard, <patrick@mezard.eu> wrote:

> Hello,
>
> On 23/07/2020 14:37, Shashanka Balakuntala wrote:
> > Hi Patrick,
> >
> > Yes, the idea that you have suggested would work, but i do have to
> mention
> > that it might just affect the next iteration. So you can just clean the
> > last parse segment and parse again and updatedb with the plugins
> activated
> > and that would do.
>
> I do not follow you. How could the similarity scores of all documents be
> collected and used by updatedb without reparsing all content? From what I
> see, the similarity scorer operates during the parse phase and the score
> should be recorded in crawl_parse.
>
> > Deleting all the the parsed segments might not work because, because a
> url
> > with score less than threshold will not be generated or fetched, so none
> of
> > its outlinks will be fetched as well. So if you just delete parse segment
> > and do the process, it would mean the all the already fetched segments
> will
> > not be impacted. So it will update the scoring, if you just need the
> score
> > for something else, please do go ahead with this.
>
> Again, I am confused. My mental model is:
>
> - Delete and reparse everything. I means similarity scores are taken in
> account and included all segments crawl_parse.
> - Run updatedb on all segments. CrawlDatum entries will be gathered by
> "url" and some final score will be generated in the reduce phase, probably
> favoring the more recent score.
>
> Now, maybe the existing crawldb might interfere during the final merge and
> I should clear it somehow, but otherwise, once the similarity scores are
> reflected in the updated crawldb, the next generate phase will take them in
> account.
>
> Obviously, they will not retroactively affect the previous fetch lists. Is
> it what you tried to tell me?
>
> Thanks for your comments,
> --
> Patrick Mézard
>
> > Lets see if anyone has any other items to add or clear here.
> >
> > *Regards*
> >    Shashanka Balakuntala Srinivasa
> >
> >
> >
> > On Thu, Jul 23, 2020 at 2:40 PM Patrick Mézard <patrick@mezard.eu>
> wrote:
> >
> >> Hello,
> >>
> >> I have crawled a first document set using a combination of depth and
> opic
> >> scoring plugins. I would like to add the similarity scoring plugin but
> >> obviously the crawldb scores should be updated for it and following
> >> "generate" phases to be effective. Is there a recommended approach to
> >> achieve this?
> >>
> >> My current understanding is since the similarity plugin operates in
> parse
> >> phase, I would have to remove all parsed data from segments, re-parse
> them
> >> and updatedb? Would that work? Is there anything smarter?
> >>
> >> Thanks,
> >> --
> >> Patrick Mézard
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message