manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anupam Bhattacharya <anupam...@gmail.com>
Subject Re: Running 2 jobs to update same document Index but different fields
Date Thu, 29 Mar 2012 09:44:29 GMT
I ran the 1 job with XML and PDF doc type together only and again i lost
all the indexes when the job got finished.

The Rejected results was for 1 document which might have confused you. The
fetching is working for many documents with SUCCESS status.

When the job is running i was able to see the indexes for PDF and XML both
from SOLR admin. But the moment it got finished all indexes were gone.

Start Time <http://localhost:8080/mcf-crawler-ui/execute.jsp>Activity<http://localhost:8080/mcf-crawler-ui/execute.jsp>
IdentifierResult Code <http://localhost:8080/mcf-crawler-ui/execute.jsp>
Bytes <http://localhost:8080/mcf-crawler-ui/execute.jsp>Time<http://localhost:8080/mcf-crawler-ui/execute.jsp>Result
Description
03-29-2012 14:41:29.053document deletion (Solr_Test_QA)
http://example.domain.com:8088/webtop/component/drl?versio...
nLabel=CURRENT&objectId=09d905e78004f63f
2000103-29-2012 14:33:40.741document ingest (Solr_Test_QA)http://
example.domain.com:8088/webtop/component/drl?versio...
nLabel=CURRENT&objectId=09d905e78004f63f
200149115603-29-2012 14:33:38.758fetch09d905e78004f63f
Success14911780

On Thu, Mar 29, 2012 at 2:01 PM, Karl Wright <daddywri@gmail.com> wrote:

>
> The "REJECTED" result is because the document has the wrong mime type or
> is too long, according to your length restriction.  Do you have just one
> job, or do you still have two?  If you have two jobs covering the same
> overall documents with different document criteria, this is the kind of
> thing that happens when you run one job after the other; the documents
> belonging to the first.  You will only need one job if you try the plan I
> was talking about, but it should include the PDFs as well as the XML
> documents.
>
> If you only have one job, then I can't explain it unless you changed the
> document criteria and ran the job a second time.
>
> Karl
>
>
>
>
> On Thu, Mar 29, 2012 at 3:39 AM, Anupam Bhattacharya <anupamb82@gmail.com>wrote:
>
>> Okay. I tried to use the id which is formed my manifoldcf documentum
>> connector. I ran the job i could see in between from the SOLR admin screen
>> that documents were getting indexed. But just after the end of the job i
>> see all my created indexes gets deleted.
>>
>> Snippet from Simple History is given below.
>>
>> Why this document deletion activity gets added and deletes all my created
>> indexes when i keep the unique id as "id" in the schema.xml file of SOLR ?
>>
>>  Start Time <http://localhost:8080/mcf-crawler-ui/execute.jsp> Activity<http://localhost:8080/mcf-crawler-ui/execute.jsp>
>> Identifier Result Code <http://localhost:8080/mcf-crawler-ui/execute.jsp>
>> Bytes <http://localhost:8080/mcf-crawler-ui/execute.jsp> Time<http://localhost:8080/mcf-crawler-ui/execute.jsp>Result
Description
>> 03-29-2012 13:00:26.837 document deletion (Solr_TEST_QA)
>> http://example.domain.com:8088/webtop/component/drl?versio...
>> nLabel=CURRENT&objectId=09d905e78000676d
>> 200 0 110
>> 03-29-2012 12:55:37.869 fetch 09d905e78000676d
>> REJECTED 86823 4184
>> 03-29-2012 12:55:34.934 document ingest (Solr_TEST_QA)
>> http://example.domain.com:8088/webtop/component/drl?versio...
>> nLabel=CURRENT&objectId=09d905e78000676d
>> 200 8158 235
>>
>> On Thu, Mar 29, 2012 at 12:41 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> "So do you find this design appropriate and feasible ?"  It sounds
>>> like you are still trying to merge records in Solr but this time using
>>> Solr Cell to somehow do this.  Since SolrCell is a pipeline, I don't
>>> think you will find it easy to keep data from one job aligned with
>>> data from another.  That's why I suggested just allowing both kinds of
>>> documents to be indexed as-is, and just making sure that you include a
>>> metadata reference to the main document in each.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Mar 28, 2012 at 2:43 PM, Anupam Bhattacharya
>>> <anupamb82@gmail.com> wrote:
>>> > The second option seems to be more useful as it will allow me add to
>>> any
>>> > business logic.
>>> > So similar to SOLR Cell (/update/extract) my new RequestHandler will be
>>> > added in solrconfig.xml which will do all the manipulations.
>>> > Later, I need to get all field values into a temp variable by first
>>> > searching by id in the lucene indexes and then add these values into
>>> the
>>> > incoming new field values list.
>>> >
>>> > So do you find this design appropriate and feasible ?
>>> >
>>> > Anupam
>>> >
>>> > On Wed, Mar 28, 2012 at 11:46 PM, Karl Wright <daddywri@gmail.com>
>>> wrote:
>>> >>
>>> >> Thanks - now I understand what you are trying to do more clearly.
>>> >>
>>> >> The Documentum connector is going to pick up the XML document and the
>>> >> PDF document as separate entities.  Thus, they'd also be indexed in
>>> >> Solr separately.  So if we use that as a starting point, let's see
>>> >> where it might lead.
>>> >>
>>> >> First, you'd want each PDF document to have metadata that refers back
>>> >> to the XML parent document.  I'm not sure how easy it is to set up
>>> >> such a metadata reference in Documentum, but I vaguely recall there
>>> >> was indeed some such field.  So let's presume you can get that.  Then,
>>> >> you'd want to make sure your Solr schema included an "XML document"
>>> >> field, which had the URL of the parent XML document (or, for XML
>>> >> documents, the document's own URL) as content.  That would be the ID
>>> >> you'd use to present a result item to a user.
>>> >>
>>> >> Does this sound reasonable so far?
>>> >>
>>> >> The only other piece you might need is manipulation of either the
>>> >> PDF's metadata, or the XML document's metadata, or both.  For that,
>>> >> I'd use Solr Cell to perform whatever mappings and manipulations made
>>> >> sense before the documents actually get indexed.
>>> >>
>>> >> Karl
>>> >>
>>> >> On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya
>>> >> <anupamb82@gmail.com> wrote:
>>> >> > I would have been happy if  I had to index PDF and XML separately.
>>> >> > But for my use-case. XML is the main document containing
>>> bibliographic
>>> >> > information (which needs to presented as search result) and
>>> consists a
>>> >> > reference to a child/supporting document which is a actual PDF
>>> file. I
>>> >> > need
>>> >> > to index the PDF text and if any search matches with the PDF content
>>> >> > then
>>> >> > the parent/XML bibliographic information needs to be presented.
>>> >> >
>>> >> > I am trying to call the SOLR search engine with one single query
to
>>> show
>>> >> > corresponding XML detail for a search term present in the PDF.
I
>>> checked
>>> >> > that from SOLR 4.x version SOLR-Join Plugin is introduced.
>>> >> > (http://wiki.apache.org/solr/Join) but work like inner query.
>>> >> >
>>> >> > Again the main requirement is that the PDF should be searchable
but
>>> it
>>> >> > master details from XML should only be presented to request the
>>> actual
>>> >> > PDF.
>>> >> >
>>> >> > -Anupam
>>> >> >
>>> >> > On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright <daddywri@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> This doesn't sound like a problem a connector can solve.  The
>>> problem
>>> >> >> sounds like severe misuse of Solr/Lucene to me.  You are using
the
>>> >> >> wrong document key and Lucene does not let you modify a document
>>> index
>>> >> >> once it is created, and no matter what you do to ManifoldCF
it
>>> can't
>>> >> >> get around that restriction.  So it sounds like you need to
>>> >> >> fundamentally rethink your design.
>>> >> >>
>>> >> >> If all you want to do is index XML and PDF as separate documents,
>>> just
>>> >> >> change your Solr output connection specification to change
the
>>> >> >> selected "id" field appropriately.  Then, BOTH documents will
be
>>> >> >> indexed by Solr, each with different metadata as you originally
>>> >> >> specified.  I'm frankly having a really hard time seeing why
this
>>> is
>>> >> >> so hard.
>>> >> >>
>>> >> >> Karl
>>> >> >>
>>> >> >>
>>> >> >> On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya
>>> >> >> <anupamb82@gmail.com> wrote:
>>> >> >> > Should I write a new Documentum Connector with our specific
>>> use-case
>>> >> >> > to
>>> >> >> > go
>>> >> >> > forward ?
>>> >> >> > I guess your book will be helpful to understand connector
>>> framework
>>> >> >> > in
>>> >> >> > manifoldcf.
>>> >> >> >
>>> >> >> > On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <daddywri@gmail.com
>>> >
>>> >> >> > wrote:
>>> >> >> >>
>>> >> >> >> Right, LUCENE never did allow you to modify a document's
>>> indexes,
>>> >> >> >> only
>>> >> >> >> replace them.  What I'm trying to tell you is that
there is no
>>> >> >> >> reason
>>> >> >> >> to have the same document ID for both documents. 
ManifoldCF
>>> will
>>> >> >> >> support treating the XML document and PDF document
as different
>>> >> >> >> documents in Solr.  But if you want them to in fact
be the same
>>> >> >> >> document, just combined in some way, neither ManifoldCF
nor
>>> Lucene
>>> >> >> >> will support that at this time.
>>> >> >> >>
>>> >> >> >> Karl
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
>>> >> >> >> <anupamb82@gmail.com> wrote:
>>> >> >> >> > I saw that the index getting created by 1st PDF
indexing job
>>> which
>>> >> >> >> > worked
>>> >> >> >> > perfectly well for a particular id. Later when
i ran the 2nd
>>> XML
>>> >> >> >> > indexing
>>> >> >> >> > Job for the same id. I lost all field indexed
by the 1st job
>>> and i
>>> >> >> >> > was
>>> >> >> >> > left
>>> >> >> >> > out with field indexes created my this 2nd job.
>>> >> >> >> >
>>> >> >> >> > I thought that it would combine field values
for a specified
>>> doc
>>> >> >> >> > id.
>>> >> >> >> >
>>> >> >> >> > As per Lucene developers they mention that by
design Lucene
>>> >> >> >> > doesn't
>>> >> >> >> > support
>>> >> >> >> > this.
>>> >> >> >> >
>>> >> >> >> > Pls. see following url ::
>>> >> >> >> > https://issues.apache.org/jira/browse/LUCENE-3837
>>> >> >> >> >
>>> >> >> >> > -Anupam
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright
<
>>> daddywri@gmail.com>
>>> >> >> >> > wrote:
>>> >> >> >> >>
>>> >> >> >> >> The Solr handler that you are using should
not matter here.
>>> >> >> >> >>
>>> >> >> >> >> Can you look at the Simple History report,
and do the
>>> following:
>>> >> >> >> >>
>>> >> >> >> >> - Look for a document that is being indexed
in both PDF and
>>> XML.
>>> >> >> >> >> - Find the "ingestion" activity for that
document for both
>>> PDF
>>> >> >> >> >> and
>>> >> >> >> >> XML
>>> >> >> >> >> - Compare the ID's (which for the ingestion
activity are the
>>> >> >> >> >> URL's
>>> >> >> >> >> of
>>> >> >> >> >> the documents in Webtop)
>>> >> >> >> >>
>>> >> >> >> >> If the URLs are in fact different, then you
should be able to
>>> >> >> >> >> make
>>> >> >> >> >> this work.  You need to look at how you configured
your Solr
>>> >> >> >> >> instance,
>>> >> >> >> >> and which fields you are specifying in your
Solr output
>>> >> >> >> >> connection.
>>> >> >> >> >> You want those Webtop urls to be indexed
as the unique
>>> document
>>> >> >> >> >> identifier in Solr, not some other ID.
>>> >> >> >> >>
>>> >> >> >> >> Thanks,
>>> >> >> >> >> Karl
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
>>> >> >> >> >> <anupamb82@gmail.com> wrote:
>>> >> >> >> >> > Today I ran 2 job one by one but it
seems since we are
>>> using
>>> >> >> >> >> > /update/extract Request Handler the
field values for
>>> common id
>>> >> >> >> >> > gets
>>> >> >> >> >> > overridden by the latest job. I want
to update certain
>>> field in
>>> >> >> >> >> > the
>>> >> >> >> >> > lucene indexes for the doc rather than
completely update
>>> with
>>> >> >> >> >> > new
>>> >> >> >> >> > values and by loosing other field value
entries.
>>> >> >> >> >> >
>>> >> >> >> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl
Wright
>>> >> >> >> >> > <daddywri@gmail.com>
>>> >> >> >> >> > wrote:
>>> >> >> >> >> >> For Documentum, content length is
in bytes, I believe.  It
>>> >> >> >> >> >> does
>>> >> >> >> >> >> not
>>> >> >> >> >> >> set the length, it filters out all
documents greater than
>>> the
>>> >> >> >> >> >> specified length.  Leaving the field
blank will perform no
>>> >> >> >> >> >> filtering.
>>> >> >> >> >> >>
>>> >> >> >> >> >> Document types in Documentum are
specified by mime type,
>>> so
>>> >> >> >> >> >> you'd
>>> >> >> >> >> >> want
>>> >> >> >> >> >> to select all that apply.  The actual
one used will
>>> depend on
>>> >> >> >> >> >> how
>>> >> >> >> >> >> your
>>> >> >> >> >> >> particular instance of Documentum
is configured, but if
>>> you
>>> >> >> >> >> >> pick
>>> >> >> >> >> >> them
>>> >> >> >> >> >> all you should have no problem.
>>> >> >> >> >> >>
>>> >> >> >> >> >> Karl
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >> On Tue, Mar 27, 2012 at 1:39 PM,
Anupam Bhattacharya
>>> >> >> >> >> >> <anupamb82@gmail.com> wrote:
>>> >> >> >> >> >>> Thanks!! Seems from your explanation
that i can update
>>> same
>>> >> >> >> >> >>> documents
>>> >> >> >> >> >>> other
>>> >> >> >> >> >>> field values. I inquired about
this because I have two
>>> >> >> >> >> >>> different
>>> >> >> >> >> >>> document
>>> >> >> >> >> >>> with a parent-child relationship
which needs to be
>>> indexed as
>>> >> >> >> >> >>> one
>>> >> >> >> >> >>> document
>>> >> >> >> >> >>> in lucene index.
>>> >> >> >> >> >>>
>>> >> >> >> >> >>> As you must have understood
by now that i am trying to do
>>> >> >> >> >> >>> this
>>> >> >> >> >> >>> for
>>> >> >> >> >> >>> Documentum CMS. I have seen
the configuration screen for
>>> >> >> >> >> >>> setting
>>> >> >> >> >> >>> the
>>> >> >> >> >> >>> Content
>>> >> >> >> >> >>> length & second for filtering
document type. So my
>>> question
>>> >> >> >> >> >>> is
>>> >> >> >> >> >>> what
>>> >> >> >> >> >>> unit the
>>> >> >> >> >> >>> Content length accepts values
(bit,bytes,KB,MB etc) &
>>> whether
>>> >> >> >> >> >>> this
>>> >> >> >> >> >>> configuration set the lengths
for documents full text
>>> >> >> >> >> >>> indexing
>>> >> >> >> >> >>> ?.
>>> >> >> >> >> >>>
>>> >> >> >> >> >>> Additionally to scan only one
kind of document e.g PDF
>>> what
>>> >> >> >> >> >>> should
>>> >> >> >> >> >>> be
>>> >> >> >> >> >>> added
>>> >> >> >> >> >>> to filter those documents? is
it application/pdf OR PDF ?
>>> >> >> >> >> >>>
>>> >> >> >> >> >>> Regards
>>> >> >> >> >> >>> Anupam
>>> >> >> >> >> >>>
>>> >> >> >> >> >>>
>>> >> >> >> >> >>> On Tue, Mar 27, 2012 at 10:55
PM, Karl Wright
>>> >> >> >> >> >>> <daddywri@gmail.com>
>>> >> >> >> >> >>> wrote:
>>> >> >> >> >> >>>>
>>> >> >> >> >> >>>> The document key in Solr
is the url of the document, as
>>> >> >> >> >> >>>> constructed
>>> >> >> >> >> >>>> by
>>> >> >> >> >> >>>> the connector you are using.
 If you are using the same
>>> >> >> >> >> >>>> document
>>> >> >> >> >> >>>> to
>>> >> >> >> >> >>>> construct two different
Solr documents, ManifoldCF by
>>> >> >> >> >> >>>> definition
>>> >> >> >> >> >>>> cannot be aware of this.
 But if these are different
>>> files
>>> >> >> >> >> >>>> from
>>> >> >> >> >> >>>> the
>>> >> >> >> >> >>>> point of view of ManifoldCF
they will have different
>>> URLs
>>> >> >> >> >> >>>> and
>>> >> >> >> >> >>>> be
>>> >> >> >> >> >>>> treated differently.  The
jobs can overlap in this case
>>> with
>>> >> >> >> >> >>>> no
>>> >> >> >> >> >>>> difficulty.
>>> >> >> >> >> >>>>
>>> >> >> >> >> >>>> Karl
>>> >> >> >> >> >>>>
>>> >> >> >> >> >>>> On Tue, Mar 27, 2012 at
1:08 PM, Anupam Bhattacharya
>>> >> >> >> >> >>>> <anupamb82@gmail.com>
wrote:
>>> >> >> >> >> >>>> > I want to configure
two jobs to index in SOLR using
>>> >> >> >> >> >>>> > ManifoldCF
>>> >> >> >> >> >>>> > using
>>> >> >> >> >> >>>> > /extract/update requestHandler.
>>> >> >> >> >> >>>> > 1st to synchronize
only XML files & 2nd to
>>> synchronize the
>>> >> >> >> >> >>>> > PDF
>>> >> >> >> >> >>>> > file.
>>> >> >> >> >> >>>> > If both these document
share a unique id. Can i
>>> combine
>>> >> >> >> >> >>>> > the
>>> >> >> >> >> >>>> > indexes
>>> >> >> >> >> >>>> > for
>>> >> >> >> >> >>>> > both
>>> >> >> >> >> >>>> > in 1 SOLR schema without
overriding the details added
>>> by
>>> >> >> >> >> >>>> > previous
>>> >> >> >> >> >>>> > job.
>>> >> >> >> >> >>>> >
>>> >> >> >> >> >>>> > suppose,
>>> >> >> >> >> >>>> >       xmldoc indexes
field0(id), field1, field2,
>>> field3
>>> >> >> >> >> >>>> > &    pdfdoc indexes
field0(id), field4, field5,
>>> field6.
>>> >> >> >> >> >>>> >
>>> >> >> >> >> >>>> > Output docindex ==>
(xml+pdf doc), field0(id), field1,
>>> >> >> >> >> >>>> > field2,
>>> >> >> >> >> >>>> > field3,
>>> >> >> >> >> >>>> > field4, field5, field6
>>> >> >> >> >> >>>> >
>>> >> >> >> >> >>>> > Regards
>>> >> >> >> >> >>>> > Anupam
>>> >> >> >> >> >>>> >
>>> >> >> >> >> >>>> >
>>> >> >> >> >> >>>
>>> >> >> >> >> >>>
>>> >> >> >> >> >>>
>>> >> >> >> >> >>>
>>> >> >> >> >> >
>>> >> >> >> >> >
>>> >> >> >> >> >
>>> >> >> >> >> > --
>>> >> >> >> >> > Thanks & Regards
>>> >> >> >> >> > Anupam Bhattacharya
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > --
>>> >> >> >> > Thanks & Regards
>>> >> >> >> > Anupam Bhattacharya
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Thanks & Regards
>>> >> >> > Anupam Bhattacharya
>>> >> >> >
>>> >> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Thanks & Regards
>>> >> > Anupam Bhattacharya
>>> >> >
>>> >> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Thanks & Regards
>>> > Anupam Bhattacharya
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Thanks & Regards
>> Anupam Bhattacharya
>>
>>
>>
>


-- 
Thanks & Regards
Anupam Bhattacharya

Mime
View raw message