manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anupam Bhattacharya <anupam...@gmail.com>
Subject Re: Running 2 jobs to update same document Index but different fields
Date Wed, 28 Mar 2012 17:26:59 GMT
Should I write a new Documentum Connector with our specific use-case to go
forward ?
I guess your book will be helpful to understand connector framework in
manifoldcf.

On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <daddywri@gmail.com> wrote:

> Right, LUCENE never did allow you to modify a document's indexes, only
> replace them.  What I'm trying to tell you is that there is no reason
> to have the same document ID for both documents.  ManifoldCF will
> support treating the XML document and PDF document as different
> documents in Solr.  But if you want them to in fact be the same
> document, just combined in some way, neither ManifoldCF nor Lucene
> will support that at this time.
>
> Karl
>
>
> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
> <anupamb82@gmail.com> wrote:
> > I saw that the index getting created by 1st PDF indexing job which worked
> > perfectly well for a particular id. Later when i ran the 2nd XML indexing
> > Job for the same id. I lost all field indexed by the 1st job and i was
> left
> > out with field indexes created my this 2nd job.
> >
> > I thought that it would combine field values for a specified doc id.
> >
> > As per Lucene developers they mention that by design Lucene doesn't
> support
> > this.
> >
> > Pls. see following url ::
> > https://issues.apache.org/jira/browse/LUCENE-3837
> >
> > -Anupam
> >
> >
> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <daddywri@gmail.com> wrote:
> >>
> >> The Solr handler that you are using should not matter here.
> >>
> >> Can you look at the Simple History report, and do the following:
> >>
> >> - Look for a document that is being indexed in both PDF and XML.
> >> - Find the "ingestion" activity for that document for both PDF and XML
> >> - Compare the ID's (which for the ingestion activity are the URL's of
> >> the documents in Webtop)
> >>
> >> If the URLs are in fact different, then you should be able to make
> >> this work.  You need to look at how you configured your Solr instance,
> >> and which fields you are specifying in your Solr output connection.
> >> You want those Webtop urls to be indexed as the unique document
> >> identifier in Solr, not some other ID.
> >>
> >> Thanks,
> >> Karl
> >>
> >>
> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
> >> <anupamb82@gmail.com> wrote:
> >> > Today I ran 2 job one by one but it seems since we are using
> >> > /update/extract Request Handler the field values for common id gets
> >> > overridden by the latest job. I want to update certain field in the
> >> > lucene indexes for the doc rather than completely update with new
> >> > values and by loosing other field value entries.
> >> >
> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright <daddywri@gmail.com>
> >> > wrote:
> >> >> For Documentum, content length is in bytes, I believe.  It does not
> >> >> set the length, it filters out all documents greater than the
> >> >> specified length.  Leaving the field blank will perform no filtering.
> >> >>
> >> >> Document types in Documentum are specified by mime type, so you'd
> want
> >> >> to select all that apply.  The actual one used will depend on how
> your
> >> >> particular instance of Documentum is configured, but if you pick them
> >> >> all you should have no problem.
> >> >>
> >> >> Karl
> >> >>
> >> >>
> >> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
> >> >> <anupamb82@gmail.com> wrote:
> >> >>> Thanks!! Seems from your explanation that i can update same
> documents
> >> >>> other
> >> >>> field values. I inquired about this because I have two different
> >> >>> document
> >> >>> with a parent-child relationship which needs to be indexed as one
> >> >>> document
> >> >>> in lucene index.
> >> >>>
> >> >>> As you must have understood by now that i am trying to do this
for
> >> >>> Documentum CMS. I have seen the configuration screen for setting
the
> >> >>> Content
> >> >>> length & second for filtering document type. So my question
is what
> >> >>> unit the
> >> >>> Content length accepts values (bit,bytes,KB,MB etc) & whether
this
> >> >>> configuration set the lengths for documents full text indexing
?.
> >> >>>
> >> >>> Additionally to scan only one kind of document e.g PDF what should
> be
> >> >>> added
> >> >>> to filter those documents? is it application/pdf OR PDF ?
> >> >>>
> >> >>> Regards
> >> >>> Anupam
> >> >>>
> >> >>>
> >> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright <daddywri@gmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> The document key in Solr is the url of the document, as constructed
> >> >>>> by
> >> >>>> the connector you are using.  If you are using the same document
to
> >> >>>> construct two different Solr documents, ManifoldCF by definition
> >> >>>> cannot be aware of this.  But if these are different files
from the
> >> >>>> point of view of ManifoldCF they will have different URLs and
be
> >> >>>> treated differently.  The jobs can overlap in this case with
no
> >> >>>> difficulty.
> >> >>>>
> >> >>>> Karl
> >> >>>>
> >> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
> >> >>>> <anupamb82@gmail.com> wrote:
> >> >>>> > I want to configure two jobs to index in SOLR using ManifoldCF
> >> >>>> > using
> >> >>>> > /extract/update requestHandler.
> >> >>>> > 1st to synchronize only XML files & 2nd to synchronize
the PDF
> >> >>>> > file.
> >> >>>> > If both these document share a unique id. Can i combine
the
> indexes
> >> >>>> > for
> >> >>>> > both
> >> >>>> > in 1 SOLR schema without overriding the details added
by previous
> >> >>>> > job.
> >> >>>> >
> >> >>>> > suppose,
> >> >>>> >       xmldoc indexes field0(id), field1, field2, field3
> >> >>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
> >> >>>> >
> >> >>>> > Output docindex ==> (xml+pdf doc), field0(id), field1,
field2,
> >> >>>> > field3,
> >> >>>> > field4, field5, field6
> >> >>>> >
> >> >>>> > Regards
> >> >>>> > Anupam
> >> >>>> >
> >> >>>> >
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks & Regards
> >> > Anupam Bhattacharya
> >
> >
> >
> >
> > --
> > Thanks & Regards
> > Anupam Bhattacharya
> >
> >
>



-- 
Thanks & Regards
Anupam Bhattacharya

Mime
View raw message