Thanks for the clarification.
--
Patrick Mézard
On 23/07/2020 19:23, Shashanka Balakuntala wrote:
> Hi Patrick,
> Yes, I did want to mention that it will not affect previous fetch lists. Sorry for the
confusion.
>
> Thanks,
> Shashanka Balakuntala
>
>
> On Thu, 23 Jul 2020, 22:40 Patrick Mézard, <patrick@mezard.eu <mailto:patrick@mezard.eu>>
wrote:
>
> Hello,
>
> On 23/07/2020 14:37, Shashanka Balakuntala wrote:
> > Hi Patrick,
> >
> > Yes, the idea that you have suggested would work, but i do have to mention
> > that it might just affect the next iteration. So you can just clean the
> > last parse segment and parse again and updatedb with the plugins activated
> > and that would do.
>
> I do not follow you. How could the similarity scores of all documents be collected
and used by updatedb without reparsing all content? From what I see, the similarity scorer
operates during the parse phase and the score should be recorded in crawl_parse.
>
> > Deleting all the the parsed segments might not work because, because a url
> > with score less than threshold will not be generated or fetched, so none of
> > its outlinks will be fetched as well. So if you just delete parse segment
> > and do the process, it would mean the all the already fetched segments will
> > not be impacted. So it will update the scoring, if you just need the score
> > for something else, please do go ahead with this.
>
> Again, I am confused. My mental model is:
>
> - Delete and reparse everything. I means similarity scores are taken in account and
included all segments crawl_parse.
> - Run updatedb on all segments. CrawlDatum entries will be gathered by "url" and
some final score will be generated in the reduce phase, probably favoring the more recent
score.
>
> Now, maybe the existing crawldb might interfere during the final merge and I should
clear it somehow, but otherwise, once the similarity scores are reflected in the updated crawldb,
the next generate phase will take them in account.
>
> Obviously, they will not retroactively affect the previous fetch lists. Is it what
you tried to tell me?
>
> Thanks for your comments,
> --
> Patrick Mézard
>
> > Lets see if anyone has any other items to add or clear here.
> >
> > *Regards*
> > Shashanka Balakuntala Srinivasa
> >
> >
> >
> > On Thu, Jul 23, 2020 at 2:40 PM Patrick Mézard <patrick@mezard.eu <mailto:patrick@mezard.eu>>
wrote:
> >
> >> Hello,
> >>
> >> I have crawled a first document set using a combination of depth and opic
> >> scoring plugins. I would like to add the similarity scoring plugin but
> >> obviously the crawldb scores should be updated for it and following
> >> "generate" phases to be effective. Is there a recommended approach to
> >> achieve this?
> >>
> >> My current understanding is since the similarity plugin operates in parse
> >> phase, I would have to remove all parsed data from segments, re-parse them
> >> and updatedb? Would that work? Is there anything smarter?
> >>
> >> Thanks,
> >> --
> >> Patrick Mézard
> >>
> >
>
|