manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Clegg <andrew.cl...@gmail.com>
Subject Re: Diagnosing "REJECTED" documents in job history
Date Mon, 04 Feb 2013 16:59:39 GMT
Sadly, I did a completely fresh build, with a new database, and I
still get REJECTED for all the documents found, with no log messages.

I also tried upgrading my DFC jars to those from Documentum 6.7 as one
of my colleagues pointed out that we use 6.6 which doesn't officially
support IDfSysObject.getContentType. Turns out that this method
returns the content type correctly if you use the 6.7 jars, even if
(like us) your Documentum installation is only 6.6 -- we verified this
with a quick Java test.

However, this doesn't seem to make a difference to our ManifoldCF problem.

I'm pretty stumped -- I think I might have to fire up ManifoldCF in a
debug JVM and set some breakpoints.


On 2 February 2013 18:14, Karl Wright <daddywri@gmail.com> wrote:
> On Sat, Feb 2, 2013 at 10:55 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>> Thanks Karl -- I'll do a new build on Monday and go through all the
>> setup again from scratch to make sure I haven't left anything out.
>>
>> Pretty sure I'm running against DFC as it wouldn't be able to get a
>> list of documents otherwise, presumably?
>>
>
> If you had an existing, already-crawled job it is potentially possible
> that if you then substituted the stub it might do something funky like
> this.  Just checking...
>
> Karl
>
>> On 1 February 2013 18:03, Karl Wright <daddywri@gmail.com> wrote:
>>> I changed the ElasticSearch connector yet again, so that if it sees a
>>> null content type, it interprets it as "application/unknown".  At
>>> least then you can make some progress until you can figure out why
>>> there is no content type coming out of documentum.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Feb 1, 2013 at 12:44 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>> Are you sure that, after you updated, you are running the Documentum
>>>> connector server process against DFC, and not with the ManifoldCF
>>>> build stubs?
>>>>
>>>> The code in the connector is pretty simple; it just uses the
>>>> getContentType() method from the IDfSysObject that represents the
>>>> document.  That should be darned near foolproof.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Fri, Feb 1, 2013 at 12:30 PM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>> We have something called DAM instead of Webtop -- Digitial Asset
>>>>> Manager I think? (Not a Documentum expert...)
>>>>>
>>>>> In DAM they show as "format: pdf" but it doesn't explicitly say what
>>>>> mimetype they are. I will escalate this to our Documentum support
>>>>> people, in case it isn't sending a mimetype.
>>>>>
>>>>> On 1 February 2013 16:02, Karl Wright <daddywri@gmail.com> wrote:
>>>>>> You can't significantly change the behavior of the documentum
>>>>>> connector by simply changing the configuration of the elastic search
>>>>>> output connector.  Did anything else change that would account for
the
>>>>>> missing mime types?  Do you see the mime types when you look at the
>>>>>> documents in Webtop?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>> On Fri, Feb 1, 2013 at 10:57 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>> Now I'm back to seeing all the documents showing as REJECTED
at the
>>>>>>> fetch stage in the job history. There's nothing in the logs to
say why
>>>>>>> though.
>>>>>>>
>>>>>>> I guess this means it's Documentum's fault for sending docs without
>>>>>>> mime types then?
>>>>>>>
>>>>>>> Thanks again for all your help!
>>>>>>>
>>>>>>> On 1 February 2013 15:14, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>> OK, I've checked in a fix to trunk.
>>>>>>>>
>>>>>>>> Please synch up and try again.
>>>>>>>> Karl
>>>>>>>>
>>>>>>>> On Fri, Feb 1, 2013 at 10:10 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>> The problem is that there are some documents you are
indexing that
>>>>>>>>> have no mime type set at all.  The ElasticSearch connector
is not
>>>>>>>>> handling that case properly.  I've opened ticket CONNECTORS-637,
and
>>>>>>>>> will fix it shortly.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>> On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>>>> Hi Karl,
>>>>>>>>>>
>>>>>>>>>> The extended logging has helped me find the next
problem :-)
>>>>>>>>>>
>>>>>>>>>> Now I'm seeing hundreds of exceptions like this in
the manifold log:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> FATAL 2013-02-01 14:32:38,255 (Worker thread '5')
- Error tossed: null
>>>>>>>>>> java.lang.NullPointerException
>>>>>>>>>>         at java.util.TreeMap.getEntry(TreeMap.java:324)
>>>>>>>>>>         at java.util.TreeMap.containsKey(TreeMap.java:209)
>>>>>>>>>>         at java.util.TreeSet.contains(TreeSet.java:217)
>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164)
>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333)
>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212)
>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091)
>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811)
>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There'll be a whole batch, then a pause, then another
batch. I suspect
>>>>>>>>>> this is because MCF is retrying?
>>>>>>>>>>
>>>>>>>>>> My theory about this is that Documentum is returning
the mime type as
>>>>>>>>>> just "pdf" instead of "application/pdf" -- although
I did add "pdf" as
>>>>>>>>>> an allowed mime type in the ElasticSearch page of
the job config, just
>>>>>>>>>> to see if it would parse this ok.
>>>>>>>>>>
>>>>>>>>>> Do you know if there's any way to map from a source's
content type to
>>>>>>>>>> a destination's content type?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 31 January 2013 23:09, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>>> I just chased down and fixed a problem in trunk.
 ElasticSearch is now
>>>>>>>>>>> returning a 201 code for successful indexing
in some cases, and the
>>>>>>>>>>> connector was not handling that as 'success'.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright
<daddywri@gmail.com> wrote:
>>>>>>>>>>>> Please let me know if you see any problems.
 I'll fix anything you
>>>>>>>>>>>> find as quickly as I can.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jan 31, 2013 at 10:19 AM, Andrew
Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>> Great, thanks, I'll give it a try.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 30 January 2013 18:52, Karl Wright
<daddywri@gmail.com> wrote:
>>>>>>>>>>>>>> I just checked in a refactoring to
trunk that should improve Elastic
>>>>>>>>>>>>>> Search error reporting significantly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:39 AM,
Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>> I agree that the Elastic Search
connector needs far better logging and
>>>>>>>>>>>>>>> error handling.  CONNECTORS-629.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:27
AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>> Nailed it with the help of
wireshark! Turns out it was my fault -- I
>>>>>>>>>>>>>>>> had set it up to use (i.e.
create) an index called DocumentumRoW but
>>>>>>>>>>>>>>>> it turns out ES index names
must be all lowercase.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Never knew that before.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Slightly annoyed that ES
didn't log that...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks again for your help
Karl :-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My only request on the MCF
front would be that it would be nice for
>>>>>>>>>>>>>>>> the output connector to log
the actual status code and content of a
>>>>>>>>>>>>>>>> non-successful HTTP response.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 30 January 2013 14:21,
Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>>> That information isn't
being recorded in manifoldcf.log unfortunately
>>>>>>>>>>>>>>>>> -- I included all that
was there. And there are no exceptions in
>>>>>>>>>>>>>>>>> elasticsearch.log either...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'll try running wireshark
to see if I can follow the TCP stream.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 30 January 2013 14:16,
Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>> Ok, ElasticSearch
is not happy about something when the document is
>>>>>>>>>>>>>>>>>> being posted.  The
connector is seeing a non-200 HTTP response, and
>>>>>>>>>>>>>>>>>> throwing an exception
as a result:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>       if (!checkResultCode(method.getStatusCode()))
>>>>>>>>>>>>>>>>>>         throw new
ManifoldCFException(getResultDescription());
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Presumably the exception
message in the log tells us what that HTTP
>>>>>>>>>>>>>>>>>> code is, but you
did not include that key info.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Jan 30, 2013
at 9:06 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>> Thanks for all
your help Karl!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It's 1.0.1 from
the binary distro.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> And yes, it says
"Connection working" when I view it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 30 January
2013 14:03, Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>> Ok, so let's
back up a bit.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> First, which
version of ManifoldCF is this?  I need to know that
>>>>>>>>>>>>>>>>>>>> before I
can interpret the stack trace.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Second, what
do you see when you view the connection in the crawler
>>>>>>>>>>>>>>>>>>>> UI?  Does
it say "Connection working", or something else, and if so,
>>>>>>>>>>>>>>>>>>>> what?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I've created
a ticket for better error reporting in this connector -
>>>>>>>>>>>>>>>>>>>> it was a
contribution and AFAIK the error handling is not very robust
>>>>>>>>>>>>>>>>>>>> at this point,
but I can fix that quickly with your help. ;-)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Jan
30, 2013 at 8:55 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>> On 30
January 2013 13:33, Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> So
you saw events in the history which correspond to these documents
>>>>>>>>>>>>>>>>>>>>>> and
which are of type "Indexation" that say "success"?  If that is the
>>>>>>>>>>>>>>>>>>>>>> case,
then the ElasticSearch connector thinks it handed the documents
>>>>>>>>>>>>>>>>>>>>>> successfully
to the ElasticSearch server.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Ah, no,
the activity is fetch rather than indexation. e.g.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 01-30-2013
13:08:16.217 fetch 09026205800698a9 Success 549541 361
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I don't
see any history entries relating to indexing as a specific
>>>>>>>>>>>>>>>>>>>>> activity
in its own right. Sorry, that was probably a red herring, I
>>>>>>>>>>>>>>>>>>>>> don't
think it's getting that far.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I just
noticed that above all the "service interruption reported"
>>>>>>>>>>>>>>>>>>>>> warnings
are some errors like this:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> ERROR
2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed:
>>>>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>>>>>>>>>>>>>>>>     
   at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
>>>>>>>>>>>>>>>>>>>>>     
   at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.<init>(ElasticSearchIndex.java:138)
>>>>>>>>>>>>>>>>>>>>>     
   at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
>>>>>>>>>>>>>>>>>>>>>     
   at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
>>>>>>>>>>>>>>>>>>>>>     
   at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
>>>>>>>>>>>>>>>>>>>>>     
   at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
>>>>>>>>>>>>>>>>>>>>>     
   at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
>>>>>>>>>>>>>>>>>>>>>     
   at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
>>>>>>>>>>>>>>>>>>>>>     
   at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>>>>>>>>>>>>>>     
   at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Sadly
there's no description, just a stacktrace.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I know
the ES server is visible from the MCF server -- actually
>>>>>>>>>>>>>>>>>>>>> they're
the same machine, and it's configured to use
>>>>>>>>>>>>>>>>>>>>> http://127.0.0.1:9200/
as the server URL. And I can go to the command
>>>>>>>>>>>>>>>>>>>>> line
on that server and curl that URL successfully.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>
>>
>>
>> --
>>
>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Mime
View raw message