manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Re: Running 2 jobs to update same document Index but different fields
Date Wed, 28 Mar 2012 12:45:13 GMT
The Solr handler that you are using should not matter here.

Can you look at the Simple History report, and do the following:

- Look for a document that is being indexed in both PDF and XML.
- Find the "ingestion" activity for that document for both PDF and XML
- Compare the ID's (which for the ingestion activity are the URL's of
the documents in Webtop)

If the URLs are in fact different, then you should be able to make
this work.  You need to look at how you configured your Solr instance,
and which fields you are specifying in your Solr output connection.
You want those Webtop urls to be indexed as the unique document
identifier in Solr, not some other ID.


On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
<> wrote:
> Today I ran 2 job one by one but it seems since we are using
> /update/extract Request Handler the field values for common id gets
> overridden by the latest job. I want to update certain field in the
> lucene indexes for the doc rather than completely update with new
> values and by loosing other field value entries.
> On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright <> wrote:
>> For Documentum, content length is in bytes, I believe.  It does not
>> set the length, it filters out all documents greater than the
>> specified length.  Leaving the field blank will perform no filtering.
>> Document types in Documentum are specified by mime type, so you'd want
>> to select all that apply.  The actual one used will depend on how your
>> particular instance of Documentum is configured, but if you pick them
>> all you should have no problem.
>> Karl
>> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
>> <> wrote:
>>> Thanks!! Seems from your explanation that i can update same documents other
>>> field values. I inquired about this because I have two different document
>>> with a parent-child relationship which needs to be indexed as one document
>>> in lucene index.
>>> As you must have understood by now that i am trying to do this for
>>> Documentum CMS. I have seen the configuration screen for setting the Content
>>> length & second for filtering document type. So my question is what unit
>>> Content length accepts values (bit,bytes,KB,MB etc) & whether this
>>> configuration set the lengths for documents full text indexing ?.
>>> Additionally to scan only one kind of document e.g PDF what should be added
>>> to filter those documents? is it application/pdf OR PDF ?
>>> Regards
>>> Anupam
>>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright <> wrote:
>>>> The document key in Solr is the url of the document, as constructed by
>>>> the connector you are using.  If you are using the same document to
>>>> construct two different Solr documents, ManifoldCF by definition
>>>> cannot be aware of this.  But if these are different files from the
>>>> point of view of ManifoldCF they will have different URLs and be
>>>> treated differently.  The jobs can overlap in this case with no
>>>> difficulty.
>>>> Karl
>>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
>>>> <> wrote:
>>>> > I want to configure two jobs to index in SOLR using ManifoldCF using
>>>> > /extract/update requestHandler.
>>>> > 1st to synchronize only XML files & 2nd to synchronize the PDF file.
>>>> > If both these document share a unique id. Can i combine the indexes
>>>> > both
>>>> > in 1 SOLR schema without overriding the details added by previous job.
>>>> >
>>>> > suppose,
>>>> >       xmldoc indexes field0(id), field1, field2, field3
>>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
>>>> >
>>>> > Output docindex ==> (xml+pdf doc), field0(id), field1, field2, field3,
>>>> > field4, field5, field6
>>>> >
>>>> > Regards
>>>> > Anupam
>>>> >
>>>> >
> --
> Thanks & Regards
> Anupam Bhattacharya

View raw message