manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Clegg <andrew.cl...@gmail.com>
Subject Re: Diagnosing "REJECTED" documents in job history
Date Fri, 01 Feb 2013 14:44:02 GMT
PS if it helps narrow the problem down: I took the entry for just
"pdf" out of the allowed mime types for the job, leaving just the
standard set of mimetypes that the job config is pre-populated with,
and it still happens.

On 1 February 2013 14:36, Andrew Clegg <andrew.clegg@gmail.com> wrote:
> Hi Karl,
>
> The extended logging has helped me find the next problem :-)
>
> Now I'm seeing hundreds of exceptions like this in the manifold log:
>
>
> FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error tossed: null
> java.lang.NullPointerException
>         at java.util.TreeMap.getEntry(TreeMap.java:324)
>         at java.util.TreeMap.containsKey(TreeMap.java:209)
>         at java.util.TreeSet.contains(TreeSet.java:217)
>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164)
>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333)
>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212)
>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091)
>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811)
>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556)
>
>
> There'll be a whole batch, then a pause, then another batch. I suspect
> this is because MCF is retrying?
>
> My theory about this is that Documentum is returning the mime type as
> just "pdf" instead of "application/pdf" -- although I did add "pdf" as
> an allowed mime type in the ElasticSearch page of the job config, just
> to see if it would parse this ok.
>
> Do you know if there's any way to map from a source's content type to
> a destination's content type?
>
>
>
> On 31 January 2013 23:09, Karl Wright <daddywri@gmail.com> wrote:
>> I just chased down and fixed a problem in trunk.  ElasticSearch is now
>> returning a 201 code for successful indexing in some cases, and the
>> connector was not handling that as 'success'.
>>
>> Karl
>>
>>
>> On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright <daddywri@gmail.com> wrote:
>>> Please let me know if you see any problems.  I'll fix anything you
>>> find as quickly as I can.
>>>
>>> Karl
>>>
>>> On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>> Great, thanks, I'll give it a try.
>>>>
>>>> On 30 January 2013 18:52, Karl Wright <daddywri@gmail.com> wrote:
>>>>> I just checked in a refactoring to trunk that should improve Elastic
>>>>> Search error reporting significantly.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>> I agree that the Elastic Search connector needs far better logging
and
>>>>>> error handling.  CONNECTORS-629.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>> On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>> Nailed it with the help of wireshark! Turns out it was my fault
-- I
>>>>>>> had set it up to use (i.e. create) an index called DocumentumRoW
but
>>>>>>> it turns out ES index names must be all lowercase.
>>>>>>>
>>>>>>> Never knew that before.
>>>>>>>
>>>>>>> Slightly annoyed that ES didn't log that...
>>>>>>>
>>>>>>> Thanks again for your help Karl :-)
>>>>>>>
>>>>>>> My only request on the MCF front would be that it would be nice
for
>>>>>>> the output connector to log the actual status code and content
of a
>>>>>>> non-successful HTTP response.
>>>>>>>
>>>>>>>
>>>>>>> On 30 January 2013 14:21, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>> That information isn't being recorded in manifoldcf.log unfortunately
>>>>>>>> -- I included all that was there. And there are no exceptions
in
>>>>>>>> elasticsearch.log either...
>>>>>>>>
>>>>>>>> I'll try running wireshark to see if I can follow the TCP
stream.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 30 January 2013 14:16, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>> Ok, ElasticSearch is not happy about something when the
document is
>>>>>>>>> being posted.  The connector is seeing a non-200 HTTP
response, and
>>>>>>>>> throwing an exception as a result:
>>>>>>>>>
>>>>>>>>>       if (!checkResultCode(method.getStatusCode()))
>>>>>>>>>         throw new ManifoldCFException(getResultDescription());
>>>>>>>>>
>>>>>>>>> Presumably the exception message in the log tells us
what that HTTP
>>>>>>>>> code is, but you did not include that key info.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>> On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>>>> Thanks for all your help Karl!
>>>>>>>>>>
>>>>>>>>>> It's 1.0.1 from the binary distro.
>>>>>>>>>>
>>>>>>>>>> And yes, it says "Connection working" when I view
it.
>>>>>>>>>>
>>>>>>>>>> On 30 January 2013 14:03, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>>> Ok, so let's back up a bit.
>>>>>>>>>>>
>>>>>>>>>>> First, which version of ManifoldCF is this? 
I need to know that
>>>>>>>>>>> before I can interpret the stack trace.
>>>>>>>>>>>
>>>>>>>>>>> Second, what do you see when you view the connection
in the crawler
>>>>>>>>>>> UI?  Does it say "Connection working", or something
else, and if so,
>>>>>>>>>>> what?
>>>>>>>>>>>
>>>>>>>>>>> I've created a ticket for better error reporting
in this connector -
>>>>>>>>>>> it was a contribution and AFAIK the error handling
is not very robust
>>>>>>>>>>> at this point, but I can fix that quickly with
your help. ;-)
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg
<andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>> On 30 January 2013 13:33, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> So you saw events in the history which
correspond to these documents
>>>>>>>>>>>>> and which are of type "Indexation" that
say "success"?  If that is the
>>>>>>>>>>>>> case, then the ElasticSearch connector
thinks it handed the documents
>>>>>>>>>>>>> successfully to the ElasticSearch server.
>>>>>>>>>>>>
>>>>>>>>>>>> Ah, no, the activity is fetch rather than
indexation. e.g.
>>>>>>>>>>>>
>>>>>>>>>>>> 01-30-2013 13:08:16.217 fetch 09026205800698a9
Success 549541 361
>>>>>>>>>>>>
>>>>>>>>>>>> I don't see any history entries relating
to indexing as a specific
>>>>>>>>>>>> activity in its own right. Sorry, that was
probably a red herring, I
>>>>>>>>>>>> don't think it's getting that far.
>>>>>>>>>>>>
>>>>>>>>>>>> I just noticed that above all the "service
interruption reported"
>>>>>>>>>>>> warnings are some errors like this:
>>>>>>>>>>>>
>>>>>>>>>>>> ERROR 2013-01-30 13:44:15,356 (Worker thread
'45') - Exception tossed:
>>>>>>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.<init>(ElasticSearchIndex.java:138)
>>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
>>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>>>>>>>>>>>>
>>>>>>>>>>>> Sadly there's no description, just a stacktrace.
>>>>>>>>>>>>
>>>>>>>>>>>> I know the ES server is visible from the
MCF server -- actually
>>>>>>>>>>>> they're the same machine, and it's configured
to use
>>>>>>>>>>>> http://127.0.0.1:9200/ as the server URL.
And I can go to the command
>>>>>>>>>>>> line on that server and curl that URL successfully.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>
>
>
> --
>
> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Mime
View raw message