manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Diagnosing "REJECTED" documents in job history
Date Fri, 01 Feb 2013 18:03:23 GMT
I changed the ElasticSearch connector yet again, so that if it sees a
null content type, it interprets it as "application/unknown".  At
least then you can make some progress until you can figure out why
there is no content type coming out of documentum.

Karl


On Fri, Feb 1, 2013 at 12:44 PM, Karl Wright <daddywri@gmail.com> wrote:
> Are you sure that, after you updated, you are running the Documentum
> connector server process against DFC, and not with the ManifoldCF
> build stubs?
>
> The code in the connector is pretty simple; it just uses the
> getContentType() method from the IDfSysObject that represents the
> document.  That should be darned near foolproof.
>
> Karl
>
>
> On Fri, Feb 1, 2013 at 12:30 PM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>> We have something called DAM instead of Webtop -- Digitial Asset
>> Manager I think? (Not a Documentum expert...)
>>
>> In DAM they show as "format: pdf" but it doesn't explicitly say what
>> mimetype they are. I will escalate this to our Documentum support
>> people, in case it isn't sending a mimetype.
>>
>> On 1 February 2013 16:02, Karl Wright <daddywri@gmail.com> wrote:
>>> You can't significantly change the behavior of the documentum
>>> connector by simply changing the configuration of the elastic search
>>> output connector.  Did anything else change that would account for the
>>> missing mime types?  Do you see the mime types when you look at the
>>> documents in Webtop?
>>>
>>> Karl
>>>
>>> On Fri, Feb 1, 2013 at 10:57 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>> Now I'm back to seeing all the documents showing as REJECTED at the
>>>> fetch stage in the job history. There's nothing in the logs to say why
>>>> though.
>>>>
>>>> I guess this means it's Documentum's fault for sending docs without
>>>> mime types then?
>>>>
>>>> Thanks again for all your help!
>>>>
>>>> On 1 February 2013 15:14, Karl Wright <daddywri@gmail.com> wrote:
>>>>> OK, I've checked in a fix to trunk.
>>>>>
>>>>> Please synch up and try again.
>>>>> Karl
>>>>>
>>>>> On Fri, Feb 1, 2013 at 10:10 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>> The problem is that there are some documents you are indexing that
>>>>>> have no mime type set at all.  The ElasticSearch connector is not
>>>>>> handling that case properly.  I've opened ticket CONNECTORS-637,
and
>>>>>> will fix it shortly.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>> On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>> Hi Karl,
>>>>>>>
>>>>>>> The extended logging has helped me find the next problem :-)
>>>>>>>
>>>>>>> Now I'm seeing hundreds of exceptions like this in the manifold
log:
>>>>>>>
>>>>>>>
>>>>>>> FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error tossed:
null
>>>>>>> java.lang.NullPointerException
>>>>>>>         at java.util.TreeMap.getEntry(TreeMap.java:324)
>>>>>>>         at java.util.TreeMap.containsKey(TreeMap.java:209)
>>>>>>>         at java.util.TreeSet.contains(TreeSet.java:217)
>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164)
>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333)
>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212)
>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091)
>>>>>>>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811)
>>>>>>>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556)
>>>>>>>
>>>>>>>
>>>>>>> There'll be a whole batch, then a pause, then another batch.
I suspect
>>>>>>> this is because MCF is retrying?
>>>>>>>
>>>>>>> My theory about this is that Documentum is returning the mime
type as
>>>>>>> just "pdf" instead of "application/pdf" -- although I did add
"pdf" as
>>>>>>> an allowed mime type in the ElasticSearch page of the job config,
just
>>>>>>> to see if it would parse this ok.
>>>>>>>
>>>>>>> Do you know if there's any way to map from a source's content
type to
>>>>>>> a destination's content type?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 31 January 2013 23:09, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>> I just chased down and fixed a problem in trunk.  ElasticSearch
is now
>>>>>>>> returning a 201 code for successful indexing in some cases,
and the
>>>>>>>> connector was not handling that as 'success'.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>> Please let me know if you see any problems.  I'll fix
anything you
>>>>>>>>> find as quickly as I can.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>> On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>>>> Great, thanks, I'll give it a try.
>>>>>>>>>>
>>>>>>>>>> On 30 January 2013 18:52, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>>> I just checked in a refactoring to trunk that
should improve Elastic
>>>>>>>>>>> Search error reporting significantly.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright
<daddywri@gmail.com> wrote:
>>>>>>>>>>>> I agree that the Elastic Search connector
needs far better logging and
>>>>>>>>>>>> error handling.  CONNECTORS-629.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg
<andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>> Nailed it with the help of wireshark!
Turns out it was my fault -- I
>>>>>>>>>>>>> had set it up to use (i.e. create) an
index called DocumentumRoW but
>>>>>>>>>>>>> it turns out ES index names must be all
lowercase.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Never knew that before.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Slightly annoyed that ES didn't log that...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks again for your help Karl :-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> My only request on the MCF front would
be that it would be nice for
>>>>>>>>>>>>> the output connector to log the actual
status code and content of a
>>>>>>>>>>>>> non-successful HTTP response.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 30 January 2013 14:21, Andrew Clegg
<andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>> That information isn't being recorded
in manifoldcf.log unfortunately
>>>>>>>>>>>>>> -- I included all that was there.
And there are no exceptions in
>>>>>>>>>>>>>> elasticsearch.log either...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'll try running wireshark to see
if I can follow the TCP stream.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 30 January 2013 14:16, Karl Wright
<daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>> Ok, ElasticSearch is not happy
about something when the document is
>>>>>>>>>>>>>>> being posted.  The connector
is seeing a non-200 HTTP response, and
>>>>>>>>>>>>>>> throwing an exception as a result:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>       if (!checkResultCode(method.getStatusCode()))
>>>>>>>>>>>>>>>         throw new ManifoldCFException(getResultDescription());
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Presumably the exception message
in the log tells us what that HTTP
>>>>>>>>>>>>>>> code is, but you did not include
that key info.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:06
AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>> Thanks for all your help
Karl!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It's 1.0.1 from the binary
distro.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> And yes, it says "Connection
working" when I view it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 30 January 2013 14:03,
Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>> Ok, so let's back up
a bit.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> First, which version
of ManifoldCF is this?  I need to know that
>>>>>>>>>>>>>>>>> before I can interpret
the stack trace.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Second, what do you see
when you view the connection in the crawler
>>>>>>>>>>>>>>>>> UI?  Does it say "Connection
working", or something else, and if so,
>>>>>>>>>>>>>>>>> what?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I've created a ticket
for better error reporting in this connector -
>>>>>>>>>>>>>>>>> it was a contribution
and AFAIK the error handling is not very robust
>>>>>>>>>>>>>>>>> at this point, but I
can fix that quickly with your help. ;-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Jan 30, 2013
at 8:55 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>>>> On 30 January 2013
13:33, Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So you saw events
in the history which correspond to these documents
>>>>>>>>>>>>>>>>>>> and which are
of type "Indexation" that say "success"?  If that is the
>>>>>>>>>>>>>>>>>>> case, then the
ElasticSearch connector thinks it handed the documents
>>>>>>>>>>>>>>>>>>> successfully
to the ElasticSearch server.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Ah, no, the activity
is fetch rather than indexation. e.g.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 01-30-2013 13:08:16.217
fetch 09026205800698a9 Success 549541 361
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't see any history
entries relating to indexing as a specific
>>>>>>>>>>>>>>>>>> activity in its own
right. Sorry, that was probably a red herring, I
>>>>>>>>>>>>>>>>>> don't think it's
getting that far.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I just noticed that
above all the "service interruption reported"
>>>>>>>>>>>>>>>>>> warnings are some
errors like this:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ERROR 2013-01-30
13:44:15,356 (Worker thread '45') - Exception tossed:
>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.<init>(ElasticSearchIndex.java:138)
>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Sadly there's no
description, just a stacktrace.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I know the ES server
is visible from the MCF server -- actually
>>>>>>>>>>>>>>>>>> they're the same
machine, and it's configured to use
>>>>>>>>>>>>>>>>>> http://127.0.0.1:9200/
as the server URL. And I can go to the command
>>>>>>>>>>>>>>>>>> line on that server
and curl that URL successfully.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>
>>
>>
>> --
>>
>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Mime
View raw message