manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1562) Document removal Elastic
Date Tue, 11 Dec 2018 07:34:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716411#comment-16716411
] 

Karl Wright commented on CONNECTORS-1562:
-----------------------------------------

I tried this out using a small number of the specific seeds provided.  I started with the
following:

{code}
https://www.uantwerpen.be/en/
https://www.uantwerpen.be/en/about-uantwerp/
https://www.uantwerpen.be/en/about-uantwerp/about-uantwerp/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/convention-halls/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/convention-halls/hof-van-liere/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/university-club
https://www.uantwerpen.be/en/about-uantwerp/facts-figures
{code}

This generated seven ingestions.  I then more-or-less randomly removed a few seeds, leaving
this:

{code}
https://www.uantwerpen.be/en/
https://www.uantwerpen.be/en/about-uantwerp/
https://www.uantwerpen.be/en/about-uantwerp/about-uantwerp/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/university-club
https://www.uantwerpen.be/en/about-uantwerp/facts-figures
{code}

Rerunning produced zero deletions, and a refetch of all seven previously-ingested documents,
with no new ingestions.

Finally, I removed all the seeds and ran it again.  A deletion was logged for every indexed
document.

My quick analysis of what is happening here is this:

- ManifoldCF keeps grave markers around for hopcount tracking.  Hopcount tracking in MCF is
extremely complex and much care is taken to avoid miscalculating the number of hops to a document,
no matter what order documents are processed in.  In order to make that work, documents cannot
be deleted from the queue just because their hopcount is too large; instead, quite a number
of documents are put in the queue and may or may not be fetched, depending if they wind up
with a low enough hopcount
- The document deletion phase removes unreachable documents, but documents that simply have
too great a hopcount but otherwise are in the queue are not precisely unreachable

In other words, the cleanup phase of a job seems to interact badly with documents that are
reachable but just have too great a hopcount; these documents seem to be overlooked for cleanup,
and will ONLY be cleaned up when they become truly unreachable.

This is not intended behavior.  However, it's also a behavior change in a very complex part
of the software, and will therefore require great care to correct without breaking something.
 Because it is not something simple, you should expect me to require a couple of weeks elapsed
time to come up with the right fix.

Furthermore, it is still true that this model is not one that I'd recommend for crawling a
web site.  The web connector is not designed to operate with hundreds of thousands of seeds;
hundreds, maybe, or thousands on a bad day, but trying to control exactly what MCF indexes
by fooling with the seed list is not what it was designed for.


> Document removal Elastic
> ------------------------
>
>                 Key: CONNECTORS-1562
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>         Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 14-07-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to keep it runneng
even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message