manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Diagnosing "REJECTED" documents in job history
Date Fri, 01 Feb 2013 17:44:19 GMT
Are you sure that, after you updated, you are running the Documentum
connector server process against DFC, and not with the ManifoldCF
build stubs?

The code in the connector is pretty simple; it just uses the
getContentType() method from the IDfSysObject that represents the
document.  That should be darned near foolproof.

Karl


On Fri, Feb 1, 2013 at 12:30 PM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
> We have something called DAM instead of Webtop -- Digitial Asset
> Manager I think? (Not a Documentum expert...)
>
> In DAM they show as "format: pdf" but it doesn't explicitly say what
> mimetype they are. I will escalate this to our Documentum support
> people, in case it isn't sending a mimetype.
>
> On 1 February 2013 16:02, Karl Wright <daddywri@gmail.com> wrote:
>> You can't significantly change the behavior of the documentum
>> connector by simply changing the configuration of the elastic search
>> output connector.  Did anything else change that would account for the
>> missing mime types?  Do you see the mime types when you look at the
>> documents in Webtop?
>>
>> Karl
>>
>> On Fri, Feb 1, 2013 at 10:57 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>> Now I'm back to seeing all the documents showing as REJECTED at the
>>> fetch stage in the job history. There's nothing in the logs to say why
>>> though.
>>>
>>> I guess this means it's Documentum's fault for sending docs without
>>> mime types then?
>>>
>>> Thanks again for all your help!
>>>
>>> On 1 February 2013 15:14, Karl Wright <daddywri@gmail.com> wrote:
>>>> OK, I've checked in a fix to trunk.
>>>>
>>>> Please synch up and try again.
>>>> Karl
>>>>
>>>> On Fri, Feb 1, 2013 at 10:10 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>>> The problem is that there are some documents you are indexing that
>>>>> have no mime type set at all.  The ElasticSearch connector is not
>>>>> handling that case properly.  I've opened ticket CONNECTORS-637, and
>>>>> will fix it shortly.
>>>>>
>>>>> Karl
>>>>>
>>>>> On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>> Hi Karl,
>>>>>>
>>>>>> The extended logging has helped me find the next problem :-)
>>>>>>
>>>>>> Now I'm seeing hundreds of exceptions like this in the manifold log:
>>>>>>
>>>>>>
>>>>>> FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error tossed:
null
>>>>>> java.lang.NullPointerException
>>>>>>         at java.util.TreeMap.getEntry(TreeMap.java:324)
>>>>>>         at java.util.TreeMap.containsKey(TreeMap.java:209)
>>>>>>         at java.util.TreeSet.contains(TreeSet.java:217)
>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164)
>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333)
>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212)
>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091)
>>>>>>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811)
>>>>>>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556)
>>>>>>
>>>>>>
>>>>>> There'll be a whole batch, then a pause, then another batch. I suspect
>>>>>> this is because MCF is retrying?
>>>>>>
>>>>>> My theory about this is that Documentum is returning the mime type
as
>>>>>> just "pdf" instead of "application/pdf" -- although I did add "pdf"
as
>>>>>> an allowed mime type in the ElasticSearch page of the job config,
just
>>>>>> to see if it would parse this ok.
>>>>>>
>>>>>> Do you know if there's any way to map from a source's content type
to
>>>>>> a destination's content type?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 31 January 2013 23:09, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>> I just chased down and fixed a problem in trunk.  ElasticSearch
is now
>>>>>>> returning a 201 code for successful indexing in some cases, and
the
>>>>>>> connector was not handling that as 'success'.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>> Please let me know if you see any problems.  I'll fix anything
you
>>>>>>>> find as quickly as I can.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>> On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>>> Great, thanks, I'll give it a try.
>>>>>>>>>
>>>>>>>>> On 30 January 2013 18:52, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>> I just checked in a refactoring to trunk that should
improve Elastic
>>>>>>>>>> Search error reporting significantly.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>>> I agree that the Elastic Search connector needs
far better logging and
>>>>>>>>>>> error handling.  CONNECTORS-629.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg
<andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>> Nailed it with the help of wireshark! Turns
out it was my fault -- I
>>>>>>>>>>>> had set it up to use (i.e. create) an index
called DocumentumRoW but
>>>>>>>>>>>> it turns out ES index names must be all lowercase.
>>>>>>>>>>>>
>>>>>>>>>>>> Never knew that before.
>>>>>>>>>>>>
>>>>>>>>>>>> Slightly annoyed that ES didn't log that...
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks again for your help Karl :-)
>>>>>>>>>>>>
>>>>>>>>>>>> My only request on the MCF front would be
that it would be nice for
>>>>>>>>>>>> the output connector to log the actual status
code and content of a
>>>>>>>>>>>> non-successful HTTP response.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 30 January 2013 14:21, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>>>>>>> That information isn't being recorded
in manifoldcf.log unfortunately
>>>>>>>>>>>>> -- I included all that was there. And
there are no exceptions in
>>>>>>>>>>>>> elasticsearch.log either...
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'll try running wireshark to see if
I can follow the TCP stream.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 30 January 2013 14:16, Karl Wright
<daddywri@gmail.com> wrote:
>>>>>>>>>>>>>> Ok, ElasticSearch is not happy about
something when the document is
>>>>>>>>>>>>>> being posted.  The connector is seeing
a non-200 HTTP response, and
>>>>>>>>>>>>>> throwing an exception as a result:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       if (!checkResultCode(method.getStatusCode()))
>>>>>>>>>>>>>>         throw new ManifoldCFException(getResultDescription());
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Presumably the exception message
in the log tells us what that HTTP
>>>>>>>>>>>>>> code is, but you did not include
that key info.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:06 AM,
Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>> Thanks for all your help Karl!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's 1.0.1 from the binary distro.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And yes, it says "Connection
working" when I view it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 30 January 2013 14:03, Karl
Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>> Ok, so let's back up a bit.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> First, which version of ManifoldCF
is this?  I need to know that
>>>>>>>>>>>>>>>> before I can interpret the
stack trace.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Second, what do you see when
you view the connection in the crawler
>>>>>>>>>>>>>>>> UI?  Does it say "Connection
working", or something else, and if so,
>>>>>>>>>>>>>>>> what?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've created a ticket for
better error reporting in this connector -
>>>>>>>>>>>>>>>> it was a contribution and
AFAIK the error handling is not very robust
>>>>>>>>>>>>>>>> at this point, but I can
fix that quickly with your help. ;-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Jan 30, 2013 at 8:55
AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>>> On 30 January 2013 13:33,
Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> So you saw events
in the history which correspond to these documents
>>>>>>>>>>>>>>>>>> and which are of
type "Indexation" that say "success"?  If that is the
>>>>>>>>>>>>>>>>>> case, then the ElasticSearch
connector thinks it handed the documents
>>>>>>>>>>>>>>>>>> successfully to the
ElasticSearch server.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ah, no, the activity
is fetch rather than indexation. e.g.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 01-30-2013 13:08:16.217
fetch 09026205800698a9 Success 549541 361
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't see any history
entries relating to indexing as a specific
>>>>>>>>>>>>>>>>> activity in its own right.
Sorry, that was probably a red herring, I
>>>>>>>>>>>>>>>>> don't think it's getting
that far.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I just noticed that above
all the "service interruption reported"
>>>>>>>>>>>>>>>>> warnings are some errors
like this:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ERROR 2013-01-30 13:44:15,356
(Worker thread '45') - Exception tossed:
>>>>>>>>>>>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.<init>(ElasticSearchIndex.java:138)
>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Sadly there's no description,
just a stacktrace.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I know the ES server
is visible from the MCF server -- actually
>>>>>>>>>>>>>>>>> they're the same machine,
and it's configured to use
>>>>>>>>>>>>>>>>> http://127.0.0.1:9200/
as the server URL. And I can go to the command
>>>>>>>>>>>>>>>>> line on that server and
curl that URL successfully.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>
>>>
>>>
>>> --
>>>
>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>
>
>
> --
>
> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Mime
View raw message