manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Diagnosing "REJECTED" documents in job history
Date Mon, 04 Feb 2013 17:03:19 GMT
After you change the DFC, the process you need to restart would be the
MCF documentum-server-process.  Restarting just agents won't change
anything.  Let's hope that that's all it is.

Karl

On Mon, Feb 4, 2013 at 11:59 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
> Sadly, I did a completely fresh build, with a new database, and I
> still get REJECTED for all the documents found, with no log messages.
>
> I also tried upgrading my DFC jars to those from Documentum 6.7 as one
> of my colleagues pointed out that we use 6.6 which doesn't officially
> support IDfSysObject.getContentType. Turns out that this method
> returns the content type correctly if you use the 6.7 jars, even if
> (like us) your Documentum installation is only 6.6 -- we verified this
> with a quick Java test.
>
> However, this doesn't seem to make a difference to our ManifoldCF problem.
>
> I'm pretty stumped -- I think I might have to fire up ManifoldCF in a
> debug JVM and set some breakpoints.
>
>
> On 2 February 2013 18:14, Karl Wright <daddywri@gmail.com> wrote:
>> On Sat, Feb 2, 2013 at 10:55 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>> Thanks Karl -- I'll do a new build on Monday and go through all the
>>> setup again from scratch to make sure I haven't left anything out.
>>>
>>> Pretty sure I'm running against DFC as it wouldn't be able to get a
>>> list of documents otherwise, presumably?
>>>
>>
>> If you had an existing, already-crawled job it is potentially possible
>> that if you then substituted the stub it might do something funky like
>> this.  Just checking...
>>
>> Karl
>>
>>> On 1 February 2013 18:03, Karl Wright <daddywri@gmail.com> wrote:
>>>> I changed the ElasticSearch connector yet again, so that if it sees a
>>>> null content type, it interprets it as "application/unknown".  At
>>>> least then you can make some progress until you can figure out why
>>>> there is no content type coming out of documentum.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Fri, Feb 1, 2013 at 12:44 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>>> Are you sure that, after you updated, you are running the Documentum
>>>>> connector server process against DFC, and not with the ManifoldCF
>>>>> build stubs?
>>>>>
>>>>> The code in the connector is pretty simple; it just uses the
>>>>> getContentType() method from the IDfSysObject that represents the
>>>>> document.  That should be darned near foolproof.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, Feb 1, 2013 at 12:30 PM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>> We have something called DAM instead of Webtop -- Digitial Asset
>>>>>> Manager I think? (Not a Documentum expert...)
>>>>>>
>>>>>> In DAM they show as "format: pdf" but it doesn't explicitly say what
>>>>>> mimetype they are. I will escalate this to our Documentum support
>>>>>> people, in case it isn't sending a mimetype.
>>>>>>
>>>>>> On 1 February 2013 16:02, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>> You can't significantly change the behavior of the documentum
>>>>>>> connector by simply changing the configuration of the elastic
search
>>>>>>> output connector.  Did anything else change that would account
for the
>>>>>>> missing mime types?  Do you see the mime types when you look
at the
>>>>>>> documents in Webtop?
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>> On Fri, Feb 1, 2013 at 10:57 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>> Now I'm back to seeing all the documents showing as REJECTED
at the
>>>>>>>> fetch stage in the job history. There's nothing in the logs
to say why
>>>>>>>> though.
>>>>>>>>
>>>>>>>> I guess this means it's Documentum's fault for sending docs
without
>>>>>>>> mime types then?
>>>>>>>>
>>>>>>>> Thanks again for all your help!
>>>>>>>>
>>>>>>>> On 1 February 2013 15:14, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>> OK, I've checked in a fix to trunk.
>>>>>>>>>
>>>>>>>>> Please synch up and try again.
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>> On Fri, Feb 1, 2013 at 10:10 AM, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>> The problem is that there are some documents you
are indexing that
>>>>>>>>>> have no mime type set at all.  The ElasticSearch
connector is not
>>>>>>>>>> handling that case properly.  I've opened ticket
CONNECTORS-637, and
>>>>>>>>>> will fix it shortly.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>
>>>>>>>>>>> The extended logging has helped me find the next
problem :-)
>>>>>>>>>>>
>>>>>>>>>>> Now I'm seeing hundreds of exceptions like this
in the manifold log:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> FATAL 2013-02-01 14:32:38,255 (Worker thread
'5') - Error tossed: null
>>>>>>>>>>> java.lang.NullPointerException
>>>>>>>>>>>         at java.util.TreeMap.getEntry(TreeMap.java:324)
>>>>>>>>>>>         at java.util.TreeMap.containsKey(TreeMap.java:209)
>>>>>>>>>>>         at java.util.TreeSet.contains(TreeSet.java:217)
>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164)
>>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333)
>>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212)
>>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091)
>>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811)
>>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> There'll be a whole batch, then a pause, then
another batch. I suspect
>>>>>>>>>>> this is because MCF is retrying?
>>>>>>>>>>>
>>>>>>>>>>> My theory about this is that Documentum is returning
the mime type as
>>>>>>>>>>> just "pdf" instead of "application/pdf" -- although
I did add "pdf" as
>>>>>>>>>>> an allowed mime type in the ElasticSearch page
of the job config, just
>>>>>>>>>>> to see if it would parse this ok.
>>>>>>>>>>>
>>>>>>>>>>> Do you know if there's any way to map from a
source's content type to
>>>>>>>>>>> a destination's content type?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 31 January 2013 23:09, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>>>> I just chased down and fixed a problem in
trunk.  ElasticSearch is now
>>>>>>>>>>>> returning a 201 code for successful indexing
in some cases, and the
>>>>>>>>>>>> connector was not handling that as 'success'.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright
<daddywri@gmail.com> wrote:
>>>>>>>>>>>>> Please let me know if you see any problems.
 I'll fix anything you
>>>>>>>>>>>>> find as quickly as I can.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jan 31, 2013 at 10:19 AM, Andrew
Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>> Great, thanks, I'll give it a try.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 30 January 2013 18:52, Karl Wright
<daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>> I just checked in a refactoring
to trunk that should improve Elastic
>>>>>>>>>>>>>>> Search error reporting significantly.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:39
AM, Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>> I agree that the Elastic
Search connector needs far better logging and
>>>>>>>>>>>>>>>> error handling.  CONNECTORS-629.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Jan 30, 2013 at 9:27
AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>>> Nailed it with the help
of wireshark! Turns out it was my fault -- I
>>>>>>>>>>>>>>>>> had set it up to use
(i.e. create) an index called DocumentumRoW but
>>>>>>>>>>>>>>>>> it turns out ES index
names must be all lowercase.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Never knew that before.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Slightly annoyed that
ES didn't log that...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks again for your
help Karl :-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> My only request on the
MCF front would be that it would be nice for
>>>>>>>>>>>>>>>>> the output connector
to log the actual status code and content of a
>>>>>>>>>>>>>>>>> non-successful HTTP response.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 30 January 2013 14:21,
Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>>>> That information
isn't being recorded in manifoldcf.log unfortunately
>>>>>>>>>>>>>>>>>> -- I included all
that was there. And there are no exceptions in
>>>>>>>>>>>>>>>>>> elasticsearch.log
either...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'll try running
wireshark to see if I can follow the TCP stream.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 30 January 2013
14:16, Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>> Ok, ElasticSearch
is not happy about something when the document is
>>>>>>>>>>>>>>>>>>> being posted.
 The connector is seeing a non-200 HTTP response, and
>>>>>>>>>>>>>>>>>>> throwing an exception
as a result:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>       if (!checkResultCode(method.getStatusCode()))
>>>>>>>>>>>>>>>>>>>         throw
new ManifoldCFException(getResultDescription());
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Presumably the
exception message in the log tells us what that HTTP
>>>>>>>>>>>>>>>>>>> code is, but
you did not include that key info.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Jan 30,
2013 at 9:06 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>> Thanks for
all your help Karl!
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> It's 1.0.1
from the binary distro.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> And yes,
it says "Connection working" when I view it.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 30 January
2013 14:03, Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>> Ok, so
let's back up a bit.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> First,
which version of ManifoldCF is this?  I need to know that
>>>>>>>>>>>>>>>>>>>>> before
I can interpret the stack trace.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Second,
what do you see when you view the connection in the crawler
>>>>>>>>>>>>>>>>>>>>> UI? 
Does it say "Connection working", or something else, and if so,
>>>>>>>>>>>>>>>>>>>>> what?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I've
created a ticket for better error reporting in this connector -
>>>>>>>>>>>>>>>>>>>>> it was
a contribution and AFAIK the error handling is not very robust
>>>>>>>>>>>>>>>>>>>>> at this
point, but I can fix that quickly with your help. ;-)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed,
Jan 30, 2013 at 8:55 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>> On
30 January 2013 13:33, Karl Wright <daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
So you saw events in the history which correspond to these documents
>>>>>>>>>>>>>>>>>>>>>>>
and which are of type "Indexation" that say "success"?  If that is the
>>>>>>>>>>>>>>>>>>>>>>>
case, then the ElasticSearch connector thinks it handed the documents
>>>>>>>>>>>>>>>>>>>>>>>
successfully to the ElasticSearch server.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Ah,
no, the activity is fetch rather than indexation. e.g.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 01-30-2013
13:08:16.217 fetch 09026205800698a9 Success 549541 361
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I
don't see any history entries relating to indexing as a specific
>>>>>>>>>>>>>>>>>>>>>> activity
in its own right. Sorry, that was probably a red herring, I
>>>>>>>>>>>>>>>>>>>>>> don't
think it's getting that far.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I
just noticed that above all the "service interruption reported"
>>>>>>>>>>>>>>>>>>>>>> warnings
are some errors like this:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ERROR
2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed:
>>>>>>>>>>>>>>>>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>>>>>>>>>>>>>>>>> 
       at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
>>>>>>>>>>>>>>>>>>>>>> 
       at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.<init>(ElasticSearchIndex.java:138)
>>>>>>>>>>>>>>>>>>>>>> 
       at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
>>>>>>>>>>>>>>>>>>>>>> 
       at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
>>>>>>>>>>>>>>>>>>>>>> 
       at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
>>>>>>>>>>>>>>>>>>>>>> 
       at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
>>>>>>>>>>>>>>>>>>>>>> 
       at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
>>>>>>>>>>>>>>>>>>>>>> 
       at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
>>>>>>>>>>>>>>>>>>>>>> 
       at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>>>>>>>>>>>>>>> 
       at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Sadly
there's no description, just a stacktrace.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I
know the ES server is visible from the MCF server -- actually
>>>>>>>>>>>>>>>>>>>>>> they're
the same machine, and it's configured to use
>>>>>>>>>>>>>>>>>>>>>> http://127.0.0.1:9200/
as the server URL. And I can go to the command
>>>>>>>>>>>>>>>>>>>>>> line
on that server and curl that URL successfully.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin
| http://twitter.com/andrew_clegg
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>
>>>
>>>
>>> --
>>>
>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>
>
>
> --
>
> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Mime
View raw message