manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Diagnosing "REJECTED" documents in job history
Date Sat, 02 Feb 2013 18:14:38 GMT
On Sat, Feb 2, 2013 at 10:55 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
> Thanks Karl -- I'll do a new build on Monday and go through all the
> setup again from scratch to make sure I haven't left anything out.
>
> Pretty sure I'm running against DFC as it wouldn't be able to get a
> list of documents otherwise, presumably?
>

If you had an existing, already-crawled job it is potentially possible
that if you then substituted the stub it might do something funky like
this.  Just checking...

Karl

> On 1 February 2013 18:03, Karl Wright <daddywri@gmail.com> wrote:
>> I changed the ElasticSearch connector yet again, so that if it sees a
>> null content type, it interprets it as "application/unknown".  At
>> least then you can make some progress until you can figure out why
>> there is no content type coming out of documentum.
>>
>> Karl
>>
>>
>> On Fri, Feb 1, 2013 at 12:44 PM, Karl Wright <daddywri@gmail.com> wrote:
>>> Are you sure that, after you updated, you are running the Documentum
>>> connector server process against DFC, and not with the ManifoldCF
>>> build stubs?
>>>
>>> The code in the connector is pretty simple; it just uses the
>>> getContentType() method from the IDfSysObject that represents the
>>> document.  That should be darned near foolproof.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Feb 1, 2013 at 12:30 PM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>> We have something called DAM instead of Webtop -- Digitial Asset
>>>> Manager I think? (Not a Documentum expert...)
>>>>
>>>> In DAM they show as "format: pdf" but it doesn't explicitly say what
>>>> mimetype they are. I will escalate this to our Documentum support
>>>> people, in case it isn't sending a mimetype.
>>>>
>>>> On 1 February 2013 16:02, Karl Wright <daddywri@gmail.com> wrote:
>>>>> You can't significantly change the behavior of the documentum
>>>>> connector by simply changing the configuration of the elastic search
>>>>> output connector.  Did anything else change that would account for the
>>>>> missing mime types?  Do you see the mime types when you look at the
>>>>> documents in Webtop?
>>>>>
>>>>> Karl
>>>>>
>>>>> On Fri, Feb 1, 2013 at 10:57 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>> Now I'm back to seeing all the documents showing as REJECTED at the
>>>>>> fetch stage in the job history. There's nothing in the logs to say
why
>>>>>> though.
>>>>>>
>>>>>> I guess this means it's Documentum's fault for sending docs without
>>>>>> mime types then?
>>>>>>
>>>>>> Thanks again for all your help!
>>>>>>
>>>>>> On 1 February 2013 15:14, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>> OK, I've checked in a fix to trunk.
>>>>>>>
>>>>>>> Please synch up and try again.
>>>>>>> Karl
>>>>>>>
>>>>>>> On Fri, Feb 1, 2013 at 10:10 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>> The problem is that there are some documents you are indexing
that
>>>>>>>> have no mime type set at all.  The ElasticSearch connector
is not
>>>>>>>> handling that case properly.  I've opened ticket CONNECTORS-637,
and
>>>>>>>> will fix it shortly.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>> On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>>> Hi Karl,
>>>>>>>>>
>>>>>>>>> The extended logging has helped me find the next problem
:-)
>>>>>>>>>
>>>>>>>>> Now I'm seeing hundreds of exceptions like this in the
manifold log:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error
tossed: null
>>>>>>>>> java.lang.NullPointerException
>>>>>>>>>         at java.util.TreeMap.getEntry(TreeMap.java:324)
>>>>>>>>>         at java.util.TreeMap.containsKey(TreeMap.java:209)
>>>>>>>>>         at java.util.TreeSet.contains(TreeSet.java:217)
>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164)
>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333)
>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212)
>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091)
>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811)
>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There'll be a whole batch, then a pause, then another
batch. I suspect
>>>>>>>>> this is because MCF is retrying?
>>>>>>>>>
>>>>>>>>> My theory about this is that Documentum is returning
the mime type as
>>>>>>>>> just "pdf" instead of "application/pdf" -- although I
did add "pdf" as
>>>>>>>>> an allowed mime type in the ElasticSearch page of the
job config, just
>>>>>>>>> to see if it would parse this ok.
>>>>>>>>>
>>>>>>>>> Do you know if there's any way to map from a source's
content type to
>>>>>>>>> a destination's content type?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 31 January 2013 23:09, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>> I just chased down and fixed a problem in trunk.
 ElasticSearch is now
>>>>>>>>>> returning a 201 code for successful indexing in some
cases, and the
>>>>>>>>>> connector was not handling that as 'success'.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>>> Please let me know if you see any problems. 
I'll fix anything you
>>>>>>>>>>> find as quickly as I can.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg
<andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>> Great, thanks, I'll give it a try.
>>>>>>>>>>>>
>>>>>>>>>>>> On 30 January 2013 18:52, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>>>>> I just checked in a refactoring to trunk
that should improve Elastic
>>>>>>>>>>>>> Search error reporting significantly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:39 AM, Karl
Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>> I agree that the Elastic Search connector
needs far better logging and
>>>>>>>>>>>>>> error handling.  CONNECTORS-629.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:27 AM,
Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>> Nailed it with the help of wireshark!
Turns out it was my fault -- I
>>>>>>>>>>>>>>> had set it up to use (i.e. create)
an index called DocumentumRoW but
>>>>>>>>>>>>>>> it turns out ES index names must
be all lowercase.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Never knew that before.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Slightly annoyed that ES didn't
log that...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks again for your help Karl
:-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My only request on the MCF front
would be that it would be nice for
>>>>>>>>>>>>>>> the output connector to log the
actual status code and content of a
>>>>>>>>>>>>>>> non-successful HTTP response.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 30 January 2013 14:21, Andrew
Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>> That information isn't being
recorded in manifoldcf.log unfortunately
>>>>>>>>>>>>>>>> -- I included all that was
there. And there are no exceptions in
>>>>>>>>>>>>>>>> elasticsearch.log either...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'll try running wireshark
to see if I can follow the TCP stream.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 30 January 2013 14:16,
Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>> Ok, ElasticSearch is
not happy about something when the document is
>>>>>>>>>>>>>>>>> being posted.  The connector
is seeing a non-200 HTTP response, and
>>>>>>>>>>>>>>>>> throwing an exception
as a result:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>       if (!checkResultCode(method.getStatusCode()))
>>>>>>>>>>>>>>>>>         throw new ManifoldCFException(getResultDescription());
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Presumably the exception
message in the log tells us what that HTTP
>>>>>>>>>>>>>>>>> code is, but you did
not include that key info.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Jan 30, 2013
at 9:06 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>>>> Thanks for all your
help Karl!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> It's 1.0.1 from the
binary distro.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> And yes, it says
"Connection working" when I view it.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 30 January 2013
14:03, Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>> Ok, so let's
back up a bit.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> First, which
version of ManifoldCF is this?  I need to know that
>>>>>>>>>>>>>>>>>>> before I can
interpret the stack trace.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Second, what
do you see when you view the connection in the crawler
>>>>>>>>>>>>>>>>>>> UI?  Does it
say "Connection working", or something else, and if so,
>>>>>>>>>>>>>>>>>>> what?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I've created
a ticket for better error reporting in this connector -
>>>>>>>>>>>>>>>>>>> it was a contribution
and AFAIK the error handling is not very robust
>>>>>>>>>>>>>>>>>>> at this point,
but I can fix that quickly with your help. ;-)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Jan 30,
2013 at 8:55 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>> On 30 January
2013 13:33, Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> So you
saw events in the history which correspond to these documents
>>>>>>>>>>>>>>>>>>>>> and which
are of type "Indexation" that say "success"?  If that is the
>>>>>>>>>>>>>>>>>>>>> case,
then the ElasticSearch connector thinks it handed the documents
>>>>>>>>>>>>>>>>>>>>> successfully
to the ElasticSearch server.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Ah, no, the
activity is fetch rather than indexation. e.g.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 01-30-2013
13:08:16.217 fetch 09026205800698a9 Success 549541 361
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I don't see
any history entries relating to indexing as a specific
>>>>>>>>>>>>>>>>>>>> activity
in its own right. Sorry, that was probably a red herring, I
>>>>>>>>>>>>>>>>>>>> don't think
it's getting that far.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I just noticed
that above all the "service interruption reported"
>>>>>>>>>>>>>>>>>>>> warnings
are some errors like this:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> ERROR 2013-01-30
13:44:15,356 (Worker thread '45') - Exception tossed:
>>>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>>>>>>>>>>>>>>>         at
org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
>>>>>>>>>>>>>>>>>>>>         at
org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.<init>(ElasticSearchIndex.java:138)
>>>>>>>>>>>>>>>>>>>>         at
org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
>>>>>>>>>>>>>>>>>>>>         at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
>>>>>>>>>>>>>>>>>>>>         at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
>>>>>>>>>>>>>>>>>>>>         at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
>>>>>>>>>>>>>>>>>>>>         at
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
>>>>>>>>>>>>>>>>>>>>         at
org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
>>>>>>>>>>>>>>>>>>>>         at
org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>>>>>>>>>>>>>         at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Sadly there's
no description, just a stacktrace.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I know the
ES server is visible from the MCF server -- actually
>>>>>>>>>>>>>>>>>>>> they're the
same machine, and it's configured to use
>>>>>>>>>>>>>>>>>>>> http://127.0.0.1:9200/
as the server URL. And I can go to the command
>>>>>>>>>>>>>>>>>>>> line on that
server and curl that URL successfully.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>
>
>
> --
>
> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Mime
View raw message