manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table
Date Tue, 05 Feb 2019 14:05:00 GMT


Karl Wright commented on CONNECTORS-1579:


The proximate cause of the problem is that there are multiple "resolutions" occurring for
one document in the JDBC crawl set.  When a connector is asked to process a document, it must
tell the framework what is to be done with it -- either it gets indexed, or it gets skipped,
or it gets deleted.  The problem is that the connector is telling the framework TWO things
for the same document.  The code in question:

    // Now, go through the original id's, and see which ones are still in the map.  These
    // did not appear in the result and are presumed to be gone from the database, and thus
must be deleted.
    for (final String documentIdentifier : fetchDocuments)
      if (!seenDocuments.contains(documentIdentifier))
        // Never saw it in the fetch attempt
        // Saw it in the fetch attempt, and we might have fetched it
        final String documentVersion = map.get(documentIdentifier);
        if (documentVersion != null)
          // This means we did not see it (or data for it) in the result set.  Delete it!

It's failing on the last line.  The connector thinks there is in fact no document that exists
(based on the version query you gave it), BUT based on the results of the other queries, it
thinks the document does exist (and was in fact processed).

I will need to look carefully at the queries and at the connector code to figure out exactly
how that can happen, and then I can let you know whether it's a bug in the code or a bug in
your queries.  Stay tuned.

> Error when crawling a MSSQL table
> ---------------------------------
>                 Key: CONNECTORS-1579
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: JDBC connector
>    Affects Versions: ManifoldCF 2.12
>            Reporter: Donald Van den Driessche
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: 636_bb2.csv
> When I'm crawling a MSSQL table through the JDBC connector I get following error on multiple
> {noformat}
> FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple document
primary component dispositions not allowed: document '636'
> java.lang.IllegalStateException: Multiple document primary component dispositions not
allowed: document '636'
> at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(
> at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(
> at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(
> at org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(
> at [mcf-pull-agent.jar:?]{noformat}
> I looked this error up on the internet and it said that it might have something to do
with using the same key for different lines.
>  I checked, but I couldn't find any duplicates that match any of the selected fields
in the JDBC.
> Hereby my queries:
>  Seeding query
> {code:java}
> FROM dbo.bb2
> WHERE search_url IS NOT NULL
> AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 'application/xml', 'application/zip');
> {code}
> Version check query: none
>  Access token query: none
>  Data query: 
> {code:java}
> pk1 AS $(IDCOLUMN), 
> search_url AS $(URLCOLUMN), 
> ISNULL(content, '') AS $(DATACOLUMN),
> doc_id, 
> search_url AS url, 
> ISNULL(title, '') as title, 
> ISNULL(groups,'') as groups, 
> ISNULL(type,'') as document_type, 
> ISNULL(users, '') as users
> FROM dbo.bb2
> {code}
> The hereby added csv is the corresponding line from the table.
> [^636_bb2.csv]
> Due to this problem, the whole crawling pipeline is being held up. It keeps on retrying
this line.
> Could you help me understand this error?

This message was sent by Atlassian JIRA

View raw message