manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Clegg <andrew.cl...@gmail.com>
Subject Re: Diagnosing "REJECTED" documents in job history
Date Sat, 02 Feb 2013 15:55:13 GMT
Thanks Karl -- I'll do a new build on Monday and go through all the
setup again from scratch to make sure I haven't left anything out.

Pretty sure I'm running against DFC as it wouldn't be able to get a
list of documents otherwise, presumably?

On 1 February 2013 18:03, Karl Wright <daddywri@gmail.com> wrote:
> I changed the ElasticSearch connector yet again, so that if it sees a
> null content type, it interprets it as "application/unknown".  At
> least then you can make some progress until you can figure out why
> there is no content type coming out of documentum.
>
> Karl
>
>
> On Fri, Feb 1, 2013 at 12:44 PM, Karl Wright <daddywri@gmail.com> wrote:
>> Are you sure that, after you updated, you are running the Documentum
>> connector server process against DFC, and not with the ManifoldCF
>> build stubs?
>>
>> The code in the connector is pretty simple; it just uses the
>> getContentType() method from the IDfSysObject that represents the
>> document.  That should be darned near foolproof.
>>
>> Karl
>>
>>
>> On Fri, Feb 1, 2013 at 12:30 PM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>> We have something called DAM instead of Webtop -- Digitial Asset
>>> Manager I think? (Not a Documentum expert...)
>>>
>>> In DAM they show as "format: pdf" but it doesn't explicitly say what
>>> mimetype they are. I will escalate this to our Documentum support
>>> people, in case it isn't sending a mimetype.
>>>
>>> On 1 February 2013 16:02, Karl Wright <daddywri@gmail.com> wrote:
>>>> You can't significantly change the behavior of the documentum
>>>> connector by simply changing the configuration of the elastic search
>>>> output connector.  Did anything else change that would account for the
>>>> missing mime types?  Do you see the mime types when you look at the
>>>> documents in Webtop?
>>>>
>>>> Karl
>>>>
>>>> On Fri, Feb 1, 2013 at 10:57 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>> Now I'm back to seeing all the documents showing as REJECTED at the
>>>>> fetch stage in the job history. There's nothing in the logs to say why
>>>>> though.
>>>>>
>>>>> I guess this means it's Documentum's fault for sending docs without
>>>>> mime types then?
>>>>>
>>>>> Thanks again for all your help!
>>>>>
>>>>> On 1 February 2013 15:14, Karl Wright <daddywri@gmail.com> wrote:
>>>>>> OK, I've checked in a fix to trunk.
>>>>>>
>>>>>> Please synch up and try again.
>>>>>> Karl
>>>>>>
>>>>>> On Fri, Feb 1, 2013 at 10:10 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>> The problem is that there are some documents you are indexing
that
>>>>>>> have no mime type set at all.  The ElasticSearch connector is
not
>>>>>>> handling that case properly.  I've opened ticket CONNECTORS-637,
and
>>>>>>> will fix it shortly.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>> On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> The extended logging has helped me find the next problem
:-)
>>>>>>>>
>>>>>>>> Now I'm seeing hundreds of exceptions like this in the manifold
log:
>>>>>>>>
>>>>>>>>
>>>>>>>> FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error
tossed: null
>>>>>>>> java.lang.NullPointerException
>>>>>>>>         at java.util.TreeMap.getEntry(TreeMap.java:324)
>>>>>>>>         at java.util.TreeMap.containsKey(TreeMap.java:209)
>>>>>>>>         at java.util.TreeSet.contains(TreeSet.java:217)
>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164)
>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333)
>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212)
>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091)
>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811)
>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556)
>>>>>>>>
>>>>>>>>
>>>>>>>> There'll be a whole batch, then a pause, then another batch.
I suspect
>>>>>>>> this is because MCF is retrying?
>>>>>>>>
>>>>>>>> My theory about this is that Documentum is returning the
mime type as
>>>>>>>> just "pdf" instead of "application/pdf" -- although I did
add "pdf" as
>>>>>>>> an allowed mime type in the ElasticSearch page of the job
config, just
>>>>>>>> to see if it would parse this ok.
>>>>>>>>
>>>>>>>> Do you know if there's any way to map from a source's content
type to
>>>>>>>> a destination's content type?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 31 January 2013 23:09, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>> I just chased down and fixed a problem in trunk.  ElasticSearch
is now
>>>>>>>>> returning a 201 code for successful indexing in some
cases, and the
>>>>>>>>> connector was not handling that as 'success'.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>> Please let me know if you see any problems.  I'll
fix anything you
>>>>>>>>>> find as quickly as I can.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>>>>> Great, thanks, I'll give it a try.
>>>>>>>>>>>
>>>>>>>>>>> On 30 January 2013 18:52, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>>>> I just checked in a refactoring to trunk
that should improve Elastic
>>>>>>>>>>>> Search error reporting significantly.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright
<daddywri@gmail.com> wrote:
>>>>>>>>>>>>> I agree that the Elastic Search connector
needs far better logging and
>>>>>>>>>>>>> error handling.  CONNECTORS-629.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:27 AM, Andrew
Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>> Nailed it with the help of wireshark!
Turns out it was my fault -- I
>>>>>>>>>>>>>> had set it up to use (i.e. create)
an index called DocumentumRoW but
>>>>>>>>>>>>>> it turns out ES index names must
be all lowercase.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Never knew that before.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Slightly annoyed that ES didn't log
that...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks again for your help Karl :-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My only request on the MCF front
would be that it would be nice for
>>>>>>>>>>>>>> the output connector to log the actual
status code and content of a
>>>>>>>>>>>>>> non-successful HTTP response.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 30 January 2013 14:21, Andrew
Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>> That information isn't being
recorded in manifoldcf.log unfortunately
>>>>>>>>>>>>>>> -- I included all that was there.
And there are no exceptions in
>>>>>>>>>>>>>>> elasticsearch.log either...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'll try running wireshark to
see if I can follow the TCP stream.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 30 January 2013 14:16, Karl
Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>> Ok, ElasticSearch is not
happy about something when the document is
>>>>>>>>>>>>>>>> being posted.  The connector
is seeing a non-200 HTTP response, and
>>>>>>>>>>>>>>>> throwing an exception as
a result:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>       if (!checkResultCode(method.getStatusCode()))
>>>>>>>>>>>>>>>>         throw new ManifoldCFException(getResultDescription());
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Presumably the exception
message in the log tells us what that HTTP
>>>>>>>>>>>>>>>> code is, but you did not
include that key info.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:06
AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>>> Thanks for all your help
Karl!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It's 1.0.1 from the binary
distro.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> And yes, it says "Connection
working" when I view it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 30 January 2013 14:03,
Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>> Ok, so let's back
up a bit.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> First, which version
of ManifoldCF is this?  I need to know that
>>>>>>>>>>>>>>>>>> before I can interpret
the stack trace.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Second, what do you
see when you view the connection in the crawler
>>>>>>>>>>>>>>>>>> UI?  Does it say
"Connection working", or something else, and if so,
>>>>>>>>>>>>>>>>>> what?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I've created a ticket
for better error reporting in this connector -
>>>>>>>>>>>>>>>>>> it was a contribution
and AFAIK the error handling is not very robust
>>>>>>>>>>>>>>>>>> at this point, but
I can fix that quickly with your help. ;-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Jan 30, 2013
at 8:55 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>> On 30 January
2013 13:33, Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> So you saw
events in the history which correspond to these documents
>>>>>>>>>>>>>>>>>>>> and which
are of type "Indexation" that say "success"?  If that is the
>>>>>>>>>>>>>>>>>>>> case, then
the ElasticSearch connector thinks it handed the documents
>>>>>>>>>>>>>>>>>>>> successfully
to the ElasticSearch server.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Ah, no, the activity
is fetch rather than indexation. e.g.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 01-30-2013 13:08:16.217
fetch 09026205800698a9 Success 549541 361
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I don't see any
history entries relating to indexing as a specific
>>>>>>>>>>>>>>>>>>> activity in its
own right. Sorry, that was probably a red herring, I
>>>>>>>>>>>>>>>>>>> don't think it's
getting that far.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I just noticed
that above all the "service interruption reported"
>>>>>>>>>>>>>>>>>>> warnings are
some errors like this:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ERROR 2013-01-30
13:44:15,356 (Worker thread '45') - Exception tossed:
>>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
>>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.<init>(ElasticSearchIndex.java:138)
>>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
>>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
>>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
>>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
>>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
>>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
>>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Sadly there's
no description, just a stacktrace.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I know the ES
server is visible from the MCF server -- actually
>>>>>>>>>>>>>>>>>>> they're the same
machine, and it's configured to use
>>>>>>>>>>>>>>>>>>> http://127.0.0.1:9200/
as the server URL. And I can go to the command
>>>>>>>>>>>>>>>>>>> line on that
server and curl that URL successfully.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>
>>>
>>>
>>> --
>>>
>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Mime
View raw message