manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Clegg <andrew.cl...@gmail.com>
Subject Re: Diagnosing "REJECTED" documents in job history
Date Fri, 01 Feb 2013 17:30:14 GMT
We have something called DAM instead of Webtop -- Digitial Asset
Manager I think? (Not a Documentum expert...)

In DAM they show as "format: pdf" but it doesn't explicitly say what
mimetype they are. I will escalate this to our Documentum support
people, in case it isn't sending a mimetype.

On 1 February 2013 16:02, Karl Wright <daddywri@gmail.com> wrote:
> You can't significantly change the behavior of the documentum
> connector by simply changing the configuration of the elastic search
> output connector.  Did anything else change that would account for the
> missing mime types?  Do you see the mime types when you look at the
> documents in Webtop?
>
> Karl
>
> On Fri, Feb 1, 2013 at 10:57 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>> Now I'm back to seeing all the documents showing as REJECTED at the
>> fetch stage in the job history. There's nothing in the logs to say why
>> though.
>>
>> I guess this means it's Documentum's fault for sending docs without
>> mime types then?
>>
>> Thanks again for all your help!
>>
>> On 1 February 2013 15:14, Karl Wright <daddywri@gmail.com> wrote:
>>> OK, I've checked in a fix to trunk.
>>>
>>> Please synch up and try again.
>>> Karl
>>>
>>> On Fri, Feb 1, 2013 at 10:10 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>> The problem is that there are some documents you are indexing that
>>>> have no mime type set at all.  The ElasticSearch connector is not
>>>> handling that case properly.  I've opened ticket CONNECTORS-637, and
>>>> will fix it shortly.
>>>>
>>>> Karl
>>>>
>>>> On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>> Hi Karl,
>>>>>
>>>>> The extended logging has helped me find the next problem :-)
>>>>>
>>>>> Now I'm seeing hundreds of exceptions like this in the manifold log:
>>>>>
>>>>>
>>>>> FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error tossed: null
>>>>> java.lang.NullPointerException
>>>>>         at java.util.TreeMap.getEntry(TreeMap.java:324)
>>>>>         at java.util.TreeMap.containsKey(TreeMap.java:209)
>>>>>         at java.util.TreeSet.contains(TreeSet.java:217)
>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164)
>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333)
>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212)
>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091)
>>>>>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811)
>>>>>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556)
>>>>>
>>>>>
>>>>> There'll be a whole batch, then a pause, then another batch. I suspect
>>>>> this is because MCF is retrying?
>>>>>
>>>>> My theory about this is that Documentum is returning the mime type as
>>>>> just "pdf" instead of "application/pdf" -- although I did add "pdf" as
>>>>> an allowed mime type in the ElasticSearch page of the job config, just
>>>>> to see if it would parse this ok.
>>>>>
>>>>> Do you know if there's any way to map from a source's content type to
>>>>> a destination's content type?
>>>>>
>>>>>
>>>>>
>>>>> On 31 January 2013 23:09, Karl Wright <daddywri@gmail.com> wrote:
>>>>>> I just chased down and fixed a problem in trunk.  ElasticSearch is
now
>>>>>> returning a 201 code for successful indexing in some cases, and the
>>>>>> connector was not handling that as 'success'.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>> Please let me know if you see any problems.  I'll fix anything
you
>>>>>>> find as quickly as I can.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>> On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>> Great, thanks, I'll give it a try.
>>>>>>>>
>>>>>>>> On 30 January 2013 18:52, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>> I just checked in a refactoring to trunk that should
improve Elastic
>>>>>>>>> Search error reporting significantly.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>> I agree that the Elastic Search connector needs far
better logging and
>>>>>>>>>> error handling.  CONNECTORS-629.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>> On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>>>>> Nailed it with the help of wireshark! Turns out
it was my fault -- I
>>>>>>>>>>> had set it up to use (i.e. create) an index called
DocumentumRoW but
>>>>>>>>>>> it turns out ES index names must be all lowercase.
>>>>>>>>>>>
>>>>>>>>>>> Never knew that before.
>>>>>>>>>>>
>>>>>>>>>>> Slightly annoyed that ES didn't log that...
>>>>>>>>>>>
>>>>>>>>>>> Thanks again for your help Karl :-)
>>>>>>>>>>>
>>>>>>>>>>> My only request on the MCF front would be that
it would be nice for
>>>>>>>>>>> the output connector to log the actual status
code and content of a
>>>>>>>>>>> non-successful HTTP response.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 30 January 2013 14:21, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>>>>>> That information isn't being recorded in
manifoldcf.log unfortunately
>>>>>>>>>>>> -- I included all that was there. And there
are no exceptions in
>>>>>>>>>>>> elasticsearch.log either...
>>>>>>>>>>>>
>>>>>>>>>>>> I'll try running wireshark to see if I can
follow the TCP stream.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 30 January 2013 14:16, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>>>>> Ok, ElasticSearch is not happy about
something when the document is
>>>>>>>>>>>>> being posted.  The connector is seeing
a non-200 HTTP response, and
>>>>>>>>>>>>> throwing an exception as a result:
>>>>>>>>>>>>>
>>>>>>>>>>>>>       if (!checkResultCode(method.getStatusCode()))
>>>>>>>>>>>>>         throw new ManifoldCFException(getResultDescription());
>>>>>>>>>>>>>
>>>>>>>>>>>>> Presumably the exception message in the
log tells us what that HTTP
>>>>>>>>>>>>> code is, but you did not include that
key info.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:06 AM, Andrew
Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>> Thanks for all your help Karl!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's 1.0.1 from the binary distro.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And yes, it says "Connection working"
when I view it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 30 January 2013 14:03, Karl Wright
<daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>> Ok, so let's back up a bit.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> First, which version of ManifoldCF
is this?  I need to know that
>>>>>>>>>>>>>>> before I can interpret the stack
trace.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Second, what do you see when
you view the connection in the crawler
>>>>>>>>>>>>>>> UI?  Does it say "Connection
working", or something else, and if so,
>>>>>>>>>>>>>>> what?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've created a ticket for better
error reporting in this connector -
>>>>>>>>>>>>>>> it was a contribution and AFAIK
the error handling is not very robust
>>>>>>>>>>>>>>> at this point, but I can fix
that quickly with your help. ;-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Jan 30, 2013 at 8:55
AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>> On 30 January 2013 13:33,
Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So you saw events in
the history which correspond to these documents
>>>>>>>>>>>>>>>>> and which are of type
"Indexation" that say "success"?  If that is the
>>>>>>>>>>>>>>>>> case, then the ElasticSearch
connector thinks it handed the documents
>>>>>>>>>>>>>>>>> successfully to the ElasticSearch
server.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ah, no, the activity is fetch
rather than indexation. e.g.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 01-30-2013 13:08:16.217 fetch
09026205800698a9 Success 549541 361
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't see any history entries
relating to indexing as a specific
>>>>>>>>>>>>>>>> activity in its own right.
Sorry, that was probably a red herring, I
>>>>>>>>>>>>>>>> don't think it's getting
that far.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I just noticed that above
all the "service interruption reported"
>>>>>>>>>>>>>>>> warnings are some errors
like this:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ERROR 2013-01-30 13:44:15,356
(Worker thread '45') - Exception tossed:
>>>>>>>>>>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.<init>(ElasticSearchIndex.java:138)
>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sadly there's no description,
just a stacktrace.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I know the ES server is visible
from the MCF server -- actually
>>>>>>>>>>>>>>>> they're the same machine,
and it's configured to use
>>>>>>>>>>>>>>>> http://127.0.0.1:9200/ as
the server URL. And I can go to the command
>>>>>>>>>>>>>>>> line on that server and curl
that URL successfully.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>
>>
>>
>> --
>>
>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Mime
View raw message