manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Diagnosing "REJECTED" documents in job history
Date Fri, 01 Feb 2013 15:10:09 GMT
The problem is that there are some documents you are indexing that
have no mime type set at all.  The ElasticSearch connector is not
handling that case properly.  I've opened ticket CONNECTORS-637, and
will fix it shortly.

Karl

On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
> Hi Karl,
>
> The extended logging has helped me find the next problem :-)
>
> Now I'm seeing hundreds of exceptions like this in the manifold log:
>
>
> FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error tossed: null
> java.lang.NullPointerException
>         at java.util.TreeMap.getEntry(TreeMap.java:324)
>         at java.util.TreeMap.containsKey(TreeMap.java:209)
>         at java.util.TreeSet.contains(TreeSet.java:217)
>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164)
>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333)
>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212)
>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091)
>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811)
>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556)
>
>
> There'll be a whole batch, then a pause, then another batch. I suspect
> this is because MCF is retrying?
>
> My theory about this is that Documentum is returning the mime type as
> just "pdf" instead of "application/pdf" -- although I did add "pdf" as
> an allowed mime type in the ElasticSearch page of the job config, just
> to see if it would parse this ok.
>
> Do you know if there's any way to map from a source's content type to
> a destination's content type?
>
>
>
> On 31 January 2013 23:09, Karl Wright <daddywri@gmail.com> wrote:
>> I just chased down and fixed a problem in trunk.  ElasticSearch is now
>> returning a 201 code for successful indexing in some cases, and the
>> connector was not handling that as 'success'.
>>
>> Karl
>>
>>
>> On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright <daddywri@gmail.com> wrote:
>>> Please let me know if you see any problems.  I'll fix anything you
>>> find as quickly as I can.
>>>
>>> Karl
>>>
>>> On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>> Great, thanks, I'll give it a try.
>>>>
>>>> On 30 January 2013 18:52, Karl Wright <daddywri@gmail.com> wrote:
>>>>> I just checked in a refactoring to trunk that should improve Elastic
>>>>> Search error reporting significantly.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>> I agree that the Elastic Search connector needs far better logging
and
>>>>>> error handling.  CONNECTORS-629.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>> On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>> Nailed it with the help of wireshark! Turns out it was my fault
-- I
>>>>>>> had set it up to use (i.e. create) an index called DocumentumRoW
but
>>>>>>> it turns out ES index names must be all lowercase.
>>>>>>>
>>>>>>> Never knew that before.
>>>>>>>
>>>>>>> Slightly annoyed that ES didn't log that...
>>>>>>>
>>>>>>> Thanks again for your help Karl :-)
>>>>>>>
>>>>>>> My only request on the MCF front would be that it would be nice
for
>>>>>>> the output connector to log the actual status code and content
of a
>>>>>>> non-successful HTTP response.
>>>>>>>
>>>>>>>
>>>>>>> On 30 January 2013 14:21, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>> That information isn't being recorded in manifoldcf.log unfortunately
>>>>>>>> -- I included all that was there. And there are no exceptions
in
>>>>>>>> elasticsearch.log either...
>>>>>>>>
>>>>>>>> I'll try running wireshark to see if I can follow the TCP
stream.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 30 January 2013 14:16, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>> Ok, ElasticSearch is not happy about something when the
document is
>>>>>>>>> being posted.  The connector is seeing a non-200 HTTP
response, and
>>>>>>>>> throwing an exception as a result:
>>>>>>>>>
>>>>>>>>>       if (!checkResultCode(method.getStatusCode()))
>>>>>>>>>         throw new ManifoldCFException(getResultDescription());
>>>>>>>>>
>>>>>>>>> Presumably the exception message in the log tells us
what that HTTP
>>>>>>>>> code is, but you did not include that key info.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>> On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>>>> Thanks for all your help Karl!
>>>>>>>>>>
>>>>>>>>>> It's 1.0.1 from the binary distro.
>>>>>>>>>>
>>>>>>>>>> And yes, it says "Connection working" when I view
it.
>>>>>>>>>>
>>>>>>>>>> On 30 January 2013 14:03, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>>> Ok, so let's back up a bit.
>>>>>>>>>>>
>>>>>>>>>>> First, which version of ManifoldCF is this? 
I need to know that
>>>>>>>>>>> before I can interpret the stack trace.
>>>>>>>>>>>
>>>>>>>>>>> Second, what do you see when you view the connection
in the crawler
>>>>>>>>>>> UI?  Does it say "Connection working", or something
else, and if so,
>>>>>>>>>>> what?
>>>>>>>>>>>
>>>>>>>>>>> I've created a ticket for better error reporting
in this connector -
>>>>>>>>>>> it was a contribution and AFAIK the error handling
is not very robust
>>>>>>>>>>> at this point, but I can fix that quickly with
your help. ;-)
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg
<andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>> On 30 January 2013 13:33, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> So you saw events in the history which
correspond to these documents
>>>>>>>>>>>>> and which are of type "Indexation" that
say "success"?  If that is the
>>>>>>>>>>>>> case, then the ElasticSearch connector
thinks it handed the documents
>>>>>>>>>>>>> successfully to the ElasticSearch server.
>>>>>>>>>>>>
>>>>>>>>>>>> Ah, no, the activity is fetch rather than
indexation. e.g.
>>>>>>>>>>>>
>>>>>>>>>>>> 01-30-2013 13:08:16.217 fetch 09026205800698a9
Success 549541 361
>>>>>>>>>>>>
>>>>>>>>>>>> I don't see any history entries relating
to indexing as a specific
>>>>>>>>>>>> activity in its own right. Sorry, that was
probably a red herring, I
>>>>>>>>>>>> don't think it's getting that far.
>>>>>>>>>>>>
>>>>>>>>>>>> I just noticed that above all the "service
interruption reported"
>>>>>>>>>>>> warnings are some errors like this:
>>>>>>>>>>>>
>>>>>>>>>>>> ERROR 2013-01-30 13:44:15,356 (Worker thread
'45') - Exception tossed:
>>>>>>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.<init>(ElasticSearchIndex.java:138)
>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>>>>>>>>>>>>
>>>>>>>>>>>> Sadly there's no description, just a stacktrace.
>>>>>>>>>>>>
>>>>>>>>>>>> I know the ES server is visible from the
MCF server -- actually
>>>>>>>>>>>> they're the same machine, and it's configured
to use
>>>>>>>>>>>> http://127.0.0.1:9200/ as the server URL.
And I can go to the command
>>>>>>>>>>>> line on that server and curl that URL successfully.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>
>
>
> --
>
> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Mime
View raw message