manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Donald Van den Driessche (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CONNECTORS-1579) Error when crawling a MSSQL table
Date Tue, 05 Feb 2019 13:38:00 GMT

     [ https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Donald Van den Driessche updated CONNECTORS-1579:
-------------------------------------------------
    Description: 
When I'm crawling a MSSQL table through the JDBC connector I get following error on multiple
lines:

 
{noformat}
FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple document primary
component dispositions not allowed: document '636'

java.lang.IllegalStateException: Multiple document primary component dispositions not allowed:
document '636'

at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125)
~[mcf-pull-agent.jar:?]

at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624)
~[mcf-pull-agent.jar:?]

at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605)
~[mcf-pull-agent.jar:?]

at org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944)
~[?:?]

at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]{noformat}
I looked this error up on the internet and it said that it might have something to do with
using the same key for different lines.
 I checked, but I couldn't find any duplicates that match any of the selected fields in the
JDBC.

Hereby my queries:
 Seeding query
{code:java}
SELECT pk1 as $(IDCOLUMN)
FROM dbo.bb2
WHERE search_url IS NOT NULL
AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 'application/xml', 'application/zip');
{code}
Version check query: none
 Access token query: none
 Data query: 

 

 
{code:java}
SELECT 
pk1 AS $(IDCOLUMN), 
search_url AS $(URLCOLUMN), 
ISNULL(content, '') AS $(DATACOLUMN),
doc_id, 
search_url AS url, 
ISNULL(title, '') as title, 
ISNULL(groups,'') as groups, 
ISNULL(type,'') as document_type, 
ISNULL(users, '') as users
FROM dbo.bb2
WHERE pk1 IN $(IDLIST);
{code}
The hereby added csv is the corresponding line from the table.

[^636_bb2.csv]

 

Due to this problem, the whole crawling pipeline is being held up. It keeps on retrying this
line.

Could you help me understand this error?

 

 

  was:
When I'm crawling a MSSQL table through the JDBC connector I get following error on multiple
lines:

 
{noformat}
FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple document primary
component dispositions not allowed: document '636'

java.lang.IllegalStateException: Multiple document primary component dispositions not allowed:
document '636'

at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125)
~[mcf-pull-agent.jar:?]

at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624)
~[mcf-pull-agent.jar:?]

at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605)
~[mcf-pull-agent.jar:?]

at org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944)
~[?:?]

at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]{noformat}
I looked this error up on the internet and it said that it might have something to do with
using the same key for different lines.
I checked, but I couldn't find any duplicates that match any of the selected fields in the
JDBC.

Hereby my queries:
Seeding query
{code:java}
SELECT pk1 as $(IDCOLUMN)
FROM dbo.bb2
WHERE search_url IS NOT NULL
AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 'application/xml', 'application/zip');
{code}
Version check query: none
Access token query: none
Data query: 

 

 
{code:java}
SELECT 
pk1 AS $(IDCOLUMN), 
search_url AS $(URLCOLUMN), 
ISNULL(content, '') AS $(DATACOLUMN),
doc_id, 
search_url AS url, 
ISNULL(title, '') as title, 
ISNULL(groups,'') as groups, 
ISNULL(type,'') as document_type, 
ISNULL(users, '') as users
FROM dbo.bb2
WHERE pk1 IN $(IDLIST);
{code}

The hereby added csv is the corresponding line from the table.

[^636_bb2.csv]

Could you help me understand this error?

 

 


> Error when crawling a MSSQL table
> ---------------------------------
>
>                 Key: CONNECTORS-1579
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1579
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: JDBC connector
>    Affects Versions: ManifoldCF 2.12
>            Reporter: Donald Van den Driessche
>            Priority: Major
>         Attachments: 636_bb2.csv
>
>
> When I'm crawling a MSSQL table through the JDBC connector I get following error on multiple
lines:
>  
> {noformat}
> FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple document
primary component dispositions not allowed: document '636'
> java.lang.IllegalStateException: Multiple document primary component dispositions not
allowed: document '636'
> at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125)
~[mcf-pull-agent.jar:?]
> at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624)
~[mcf-pull-agent.jar:?]
> at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605)
~[mcf-pull-agent.jar:?]
> at org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944)
~[?:?]
> at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]{noformat}
> I looked this error up on the internet and it said that it might have something to do
with using the same key for different lines.
>  I checked, but I couldn't find any duplicates that match any of the selected fields
in the JDBC.
> Hereby my queries:
>  Seeding query
> {code:java}
> SELECT pk1 as $(IDCOLUMN)
> FROM dbo.bb2
> WHERE search_url IS NOT NULL
> AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 'application/xml', 'application/zip');
> {code}
> Version check query: none
>  Access token query: none
>  Data query: 
>  
>  
> {code:java}
> SELECT 
> pk1 AS $(IDCOLUMN), 
> search_url AS $(URLCOLUMN), 
> ISNULL(content, '') AS $(DATACOLUMN),
> doc_id, 
> search_url AS url, 
> ISNULL(title, '') as title, 
> ISNULL(groups,'') as groups, 
> ISNULL(type,'') as document_type, 
> ISNULL(users, '') as users
> FROM dbo.bb2
> WHERE pk1 IN $(IDLIST);
> {code}
> The hereby added csv is the corresponding line from the table.
> [^636_bb2.csv]
>  
> Due to this problem, the whole crawling pipeline is being held up. It keeps on retrying
this line.
> Could you help me understand this error?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message