manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anupam Bhattacharya <anupam...@gmail.com>
Subject Re: Running 2 jobs to update same document Index but different fields
Date Wed, 28 Mar 2012 18:03:40 GMT
I would have been happy if  I had to index PDF and XML separately.
But for my use-case. XML is the main document containing bibliographic
information (which needs to presented as search result) and consists a
reference to a child/supporting document which is a actual PDF file. I need
to index the PDF text and if any search matches with the PDF content then
the parent/XML bibliographic information needs to be presented.

I am trying to call the SOLR search engine with one single query to show
corresponding XML detail for a search term present in the PDF. I checked
that from SOLR 4.x version SOLR-Join Plugin is introduced. (
http://wiki.apache.org/solr/Join) but work like inner query.

Again the main requirement is that the PDF should be searchable but it
master details from XML should only be presented to request the actual PDF.

-Anupam

On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright <daddywri@gmail.com> wrote:

> This doesn't sound like a problem a connector can solve.  The problem
> sounds like severe misuse of Solr/Lucene to me.  You are using the
> wrong document key and Lucene does not let you modify a document index
> once it is created, and no matter what you do to ManifoldCF it can't
> get around that restriction.  So it sounds like you need to
> fundamentally rethink your design.
>
> If all you want to do is index XML and PDF as separate documents, just
> change your Solr output connection specification to change the
> selected "id" field appropriately.  Then, BOTH documents will be
> indexed by Solr, each with different metadata as you originally
> specified.  I'm frankly having a really hard time seeing why this is
> so hard.
>
> Karl
>
>
> On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya
> <anupamb82@gmail.com> wrote:
> > Should I write a new Documentum Connector with our specific use-case to
> go
> > forward ?
> > I guess your book will be helpful to understand connector framework in
> > manifoldcf.
> >
> > On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <daddywri@gmail.com> wrote:
> >>
> >> Right, LUCENE never did allow you to modify a document's indexes, only
> >> replace them.  What I'm trying to tell you is that there is no reason
> >> to have the same document ID for both documents.  ManifoldCF will
> >> support treating the XML document and PDF document as different
> >> documents in Solr.  But if you want them to in fact be the same
> >> document, just combined in some way, neither ManifoldCF nor Lucene
> >> will support that at this time.
> >>
> >> Karl
> >>
> >>
> >> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
> >> <anupamb82@gmail.com> wrote:
> >> > I saw that the index getting created by 1st PDF indexing job which
> >> > worked
> >> > perfectly well for a particular id. Later when i ran the 2nd XML
> >> > indexing
> >> > Job for the same id. I lost all field indexed by the 1st job and i was
> >> > left
> >> > out with field indexes created my this 2nd job.
> >> >
> >> > I thought that it would combine field values for a specified doc id.
> >> >
> >> > As per Lucene developers they mention that by design Lucene doesn't
> >> > support
> >> > this.
> >> >
> >> > Pls. see following url ::
> >> > https://issues.apache.org/jira/browse/LUCENE-3837
> >> >
> >> > -Anupam
> >> >
> >> >
> >> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <daddywri@gmail.com>
> wrote:
> >> >>
> >> >> The Solr handler that you are using should not matter here.
> >> >>
> >> >> Can you look at the Simple History report, and do the following:
> >> >>
> >> >> - Look for a document that is being indexed in both PDF and XML.
> >> >> - Find the "ingestion" activity for that document for both PDF and
> XML
> >> >> - Compare the ID's (which for the ingestion activity are the URL's
of
> >> >> the documents in Webtop)
> >> >>
> >> >> If the URLs are in fact different, then you should be able to make
> >> >> this work.  You need to look at how you configured your Solr
> instance,
> >> >> and which fields you are specifying in your Solr output connection.
> >> >> You want those Webtop urls to be indexed as the unique document
> >> >> identifier in Solr, not some other ID.
> >> >>
> >> >> Thanks,
> >> >> Karl
> >> >>
> >> >>
> >> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
> >> >> <anupamb82@gmail.com> wrote:
> >> >> > Today I ran 2 job one by one but it seems since we are using
> >> >> > /update/extract Request Handler the field values for common id
gets
> >> >> > overridden by the latest job. I want to update certain field in
the
> >> >> > lucene indexes for the doc rather than completely update with
new
> >> >> > values and by loosing other field value entries.
> >> >> >
> >> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright <daddywri@gmail.com>
> >> >> > wrote:
> >> >> >> For Documentum, content length is in bytes, I believe.  It
does
> not
> >> >> >> set the length, it filters out all documents greater than
the
> >> >> >> specified length.  Leaving the field blank will perform no
> >> >> >> filtering.
> >> >> >>
> >> >> >> Document types in Documentum are specified by mime type, so
you'd
> >> >> >> want
> >> >> >> to select all that apply.  The actual one used will depend
on how
> >> >> >> your
> >> >> >> particular instance of Documentum is configured, but if you
pick
> >> >> >> them
> >> >> >> all you should have no problem.
> >> >> >>
> >> >> >> Karl
> >> >> >>
> >> >> >>
> >> >> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
> >> >> >> <anupamb82@gmail.com> wrote:
> >> >> >>> Thanks!! Seems from your explanation that i can update
same
> >> >> >>> documents
> >> >> >>> other
> >> >> >>> field values. I inquired about this because I have two
different
> >> >> >>> document
> >> >> >>> with a parent-child relationship which needs to be indexed
as one
> >> >> >>> document
> >> >> >>> in lucene index.
> >> >> >>>
> >> >> >>> As you must have understood by now that i am trying to
do this
> for
> >> >> >>> Documentum CMS. I have seen the configuration screen for
setting
> >> >> >>> the
> >> >> >>> Content
> >> >> >>> length & second for filtering document type. So my
question is
> what
> >> >> >>> unit the
> >> >> >>> Content length accepts values (bit,bytes,KB,MB etc) &
whether
> this
> >> >> >>> configuration set the lengths for documents full text
indexing ?.
> >> >> >>>
> >> >> >>> Additionally to scan only one kind of document e.g PDF
what
> should
> >> >> >>> be
> >> >> >>> added
> >> >> >>> to filter those documents? is it application/pdf OR PDF
?
> >> >> >>>
> >> >> >>> Regards
> >> >> >>> Anupam
> >> >> >>>
> >> >> >>>
> >> >> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright <
> daddywri@gmail.com>
> >> >> >>> wrote:
> >> >> >>>>
> >> >> >>>> The document key in Solr is the url of the document,
as
> >> >> >>>> constructed
> >> >> >>>> by
> >> >> >>>> the connector you are using.  If you are using the
same document
> >> >> >>>> to
> >> >> >>>> construct two different Solr documents, ManifoldCF
by definition
> >> >> >>>> cannot be aware of this.  But if these are different
files from
> >> >> >>>> the
> >> >> >>>> point of view of ManifoldCF they will have different
URLs and be
> >> >> >>>> treated differently.  The jobs can overlap in this
case with no
> >> >> >>>> difficulty.
> >> >> >>>>
> >> >> >>>> Karl
> >> >> >>>>
> >> >> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
> >> >> >>>> <anupamb82@gmail.com> wrote:
> >> >> >>>> > I want to configure two jobs to index in SOLR
using ManifoldCF
> >> >> >>>> > using
> >> >> >>>> > /extract/update requestHandler.
> >> >> >>>> > 1st to synchronize only XML files & 2nd to
synchronize the PDF
> >> >> >>>> > file.
> >> >> >>>> > If both these document share a unique id. Can
i combine the
> >> >> >>>> > indexes
> >> >> >>>> > for
> >> >> >>>> > both
> >> >> >>>> > in 1 SOLR schema without overriding the details
added by
> >> >> >>>> > previous
> >> >> >>>> > job.
> >> >> >>>> >
> >> >> >>>> > suppose,
> >> >> >>>> >       xmldoc indexes field0(id), field1, field2,
field3
> >> >> >>>> > &    pdfdoc indexes field0(id), field4, field5,
field6.
> >> >> >>>> >
> >> >> >>>> > Output docindex ==> (xml+pdf doc), field0(id),
field1, field2,
> >> >> >>>> > field3,
> >> >> >>>> > field4, field5, field6
> >> >> >>>> >
> >> >> >>>> > Regards
> >> >> >>>> > Anupam
> >> >> >>>> >
> >> >> >>>> >
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Thanks & Regards
> >> >> > Anupam Bhattacharya
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks & Regards
> >> > Anupam Bhattacharya
> >> >
> >> >
> >
> >
> >
> >
> > --
> > Thanks & Regards
> > Anupam Bhattacharya
> >
> >
>



-- 
Thanks & Regards
Anupam Bhattacharya

Mime
View raw message