nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Mézard <patr...@mezard.eu>
Subject Re: Reconfiguring scoring plugin
Date Thu, 23 Jul 2020 17:31:56 GMT
Thanks for the clarification.
--
Patrick Mézard

On 23/07/2020 19:23, Shashanka Balakuntala wrote:
> Hi Patrick,
> Yes, I did want to mention that it will not affect previous fetch lists. Sorry for the
confusion.
> 
> Thanks,
> Shashanka Balakuntala
> 
> 
> On Thu, 23 Jul 2020, 22:40 Patrick Mézard, <patrick@mezard.eu <mailto:patrick@mezard.eu>>
wrote:
> 
>     Hello,
> 
>     On 23/07/2020 14:37, Shashanka Balakuntala wrote:
>      > Hi Patrick,
>      >
>      > Yes, the idea that you have suggested would work, but i do have to mention
>      > that it might just affect the next iteration. So you can just clean the
>      > last parse segment and parse again and updatedb with the plugins activated
>      > and that would do.
> 
>     I do not follow you. How could the similarity scores of all documents be collected
and used by updatedb without reparsing all content? From what I see, the similarity scorer
operates during the parse phase and the score should be recorded in crawl_parse.
> 
>      > Deleting all the the parsed segments might not work because, because a url
>      > with score less than threshold will not be generated or fetched, so none of
>      > its outlinks will be fetched as well. So if you just delete parse segment
>      > and do the process, it would mean the all the already fetched segments will
>      > not be impacted. So it will update the scoring, if you just need the score
>      > for something else, please do go ahead with this.
> 
>     Again, I am confused. My mental model is:
> 
>     - Delete and reparse everything. I means similarity scores are taken in account and
included all segments crawl_parse.
>     - Run updatedb on all segments. CrawlDatum entries will be gathered by "url" and
some final score will be generated in the reduce phase, probably favoring the more recent
score.
> 
>     Now, maybe the existing crawldb might interfere during the final merge and I should
clear it somehow, but otherwise, once the similarity scores are reflected in the updated crawldb,
the next generate phase will take them in account.
> 
>     Obviously, they will not retroactively affect the previous fetch lists. Is it what
you tried to tell me?
> 
>     Thanks for your comments,
>     --
>     Patrick Mézard
> 
>      > Lets see if anyone has any other items to add or clear here.
>      >
>      > *Regards*
>      >    Shashanka Balakuntala Srinivasa
>      >
>      >
>      >
>      > On Thu, Jul 23, 2020 at 2:40 PM Patrick Mézard <patrick@mezard.eu <mailto:patrick@mezard.eu>>
wrote:
>      >
>      >> Hello,
>      >>
>      >> I have crawled a first document set using a combination of depth and opic
>      >> scoring plugins. I would like to add the similarity scoring plugin but
>      >> obviously the crawldb scores should be updated for it and following
>      >> "generate" phases to be effective. Is there a recommended approach to
>      >> achieve this?
>      >>
>      >> My current understanding is since the similarity plugin operates in parse
>      >> phase, I would have to remove all parsed data from segments, re-parse them
>      >> and updatedb? Would that work? Is there anything smarter?
>      >>
>      >> Thanks,
>      >> --
>      >> Patrick Mézard
>      >>
>      >
> 


Mime
View raw message