manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anupam Bhattacharya <anupam...@gmail.com>
Subject Re: Running 2 jobs to update same document Index but different fields
Date Wed, 28 Mar 2012 13:09:18 GMT
I saw that the index getting created by 1st PDF indexing job which worked
perfectly well for a particular id. Later when i ran the 2nd XML indexing
Job for the same id. I lost all field indexed by the 1st job and i was left
out with field indexes created my this 2nd job.

I thought that it would combine field values for a specified doc id.

As per Lucene developers they mention that by design Lucene doesn't support
this.

Pls. see following url ::  https://issues.apache.org/jira/browse/LUCENE-3837


-Anupam

On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <daddywri@gmail.com> wrote:

> The Solr handler that you are using should not matter here.
>
> Can you look at the Simple History report, and do the following:
>
> - Look for a document that is being indexed in both PDF and XML.
> - Find the "ingestion" activity for that document for both PDF and XML
> - Compare the ID's (which for the ingestion activity are the URL's of
> the documents in Webtop)
>
> If the URLs are in fact different, then you should be able to make
> this work.  You need to look at how you configured your Solr instance,
> and which fields you are specifying in your Solr output connection.
> You want those Webtop urls to be indexed as the unique document
> identifier in Solr, not some other ID.
>
> Thanks,
> Karl
>
>
> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
> <anupamb82@gmail.com> wrote:
> > Today I ran 2 job one by one but it seems since we are using
> > /update/extract Request Handler the field values for common id gets
> > overridden by the latest job. I want to update certain field in the
> > lucene indexes for the doc rather than completely update with new
> > values and by loosing other field value entries.
> >
> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright <daddywri@gmail.com>
> wrote:
> >> For Documentum, content length is in bytes, I believe.  It does not
> >> set the length, it filters out all documents greater than the
> >> specified length.  Leaving the field blank will perform no filtering.
> >>
> >> Document types in Documentum are specified by mime type, so you'd want
> >> to select all that apply.  The actual one used will depend on how your
> >> particular instance of Documentum is configured, but if you pick them
> >> all you should have no problem.
> >>
> >> Karl
> >>
> >>
> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
> >> <anupamb82@gmail.com> wrote:
> >>> Thanks!! Seems from your explanation that i can update same documents
> other
> >>> field values. I inquired about this because I have two different
> document
> >>> with a parent-child relationship which needs to be indexed as one
> document
> >>> in lucene index.
> >>>
> >>> As you must have understood by now that i am trying to do this for
> >>> Documentum CMS. I have seen the configuration screen for setting the
> Content
> >>> length & second for filtering document type. So my question is what
> unit the
> >>> Content length accepts values (bit,bytes,KB,MB etc) & whether this
> >>> configuration set the lengths for documents full text indexing ?.
> >>>
> >>> Additionally to scan only one kind of document e.g PDF what should be
> added
> >>> to filter those documents? is it application/pdf OR PDF ?
> >>>
> >>> Regards
> >>> Anupam
> >>>
> >>>
> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright <daddywri@gmail.com>
> wrote:
> >>>>
> >>>> The document key in Solr is the url of the document, as constructed
by
> >>>> the connector you are using.  If you are using the same document to
> >>>> construct two different Solr documents, ManifoldCF by definition
> >>>> cannot be aware of this.  But if these are different files from the
> >>>> point of view of ManifoldCF they will have different URLs and be
> >>>> treated differently.  The jobs can overlap in this case with no
> >>>> difficulty.
> >>>>
> >>>> Karl
> >>>>
> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
> >>>> <anupamb82@gmail.com> wrote:
> >>>> > I want to configure two jobs to index in SOLR using ManifoldCF
using
> >>>> > /extract/update requestHandler.
> >>>> > 1st to synchronize only XML files & 2nd to synchronize the
PDF file.
> >>>> > If both these document share a unique id. Can i combine the indexes
> for
> >>>> > both
> >>>> > in 1 SOLR schema without overriding the details added by previous
> job.
> >>>> >
> >>>> > suppose,
> >>>> >       xmldoc indexes field0(id), field1, field2, field3
> >>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
> >>>> >
> >>>> > Output docindex ==> (xml+pdf doc), field0(id), field1, field2,
> field3,
> >>>> > field4, field5, field6
> >>>> >
> >>>> > Regards
> >>>> > Anupam
> >>>> >
> >>>> >
> >>>
> >>>
> >>>
> >>>
> >
> >
> >
> > --
> > Thanks & Regards
> > Anupam Bhattacharya
>



-- 
Thanks & Regards
Anupam Bhattacharya

Mime
View raw message