From "Brandon Jockman" <>
Subject Re: best practice for indexing multiple equiv fieldnames
Date Wed, 08 May 2002 14:08:42 GMT

As Alexander suggested, I would also recommend breaking your XML documents
into multiple Lucene Documents. Therefore each element, pi,( & other nodes)
of interest can have its own Lucene Document. You can identify the sets of
related Lucene Documents that represent the same true XML document with a
per-document identifier.

If you want to see an example of this, check out the XML-Indexing
contribution (c/o W. Eliot Kimber). This solution preserves and makes many
XML structural relationships searchable. This includes preserving (and
keeping separately indentifiable) elements with the same name and the same
parent. A DOM tree location is stored, and is therefore available after
searching for navigating to specific elements in the document. It also
indexes attributes, etc...

Specific to your requirements, I would recommend specializing the indexing
to the Document Type/Schema to only index the elements you want to search
on. This will minimize the number of Lucene Docs created for each real
document. You will also need to store the parent id attribute value in child
elements based on the node name. You could therefore index the parent id as
a Field on the Child element Documents and make that a part of your queries.
It should only take a few minutes to make this change.

The code should be pretty self-explanatory, but feel free to email me if you
have any specific questions.

One problem you will encounter with this is when searching for the same
document that contains multiple element names with a logical 'AND'

For example:
tagname:book AND tagname:magazine

These two tags will be in separate documents, and therefore you must either
make a small alteration to the query parser or do a post-search process to
combine searches based on the document id.

Hope this helps,


Brandon Jockman,
Consultant, ISOGEN International, LLC.

----- Original Message -----
From: "Alexander Belskis" <>
To: "'Lucene Users List'" <>
Sent: Monday, May 06, 2002 3:47 AM
Subject: RE: best practice for indexing multiple equiv fieldnames

> Dude, Landon-
> How are you doing?  To the novice question I have what might be a novice
> answer... but hope it helps.
> I don't think that the "Lucene documents" you create and add to the index
> need to have the same structure as the "XML documents" you read.  Instead
> of creating one Lucene document for each XML document, perhaps things will
> be easier for you if you create multiple Lucene documents for each XML
> you parse (one Lucene document for each block).
> best,
> Belskis
> --
> Alexander Belskis
> SchlumbergerSema - International Telematics Applications
> Biotechnologies & Healthcare
> c/Albasanz, 12 - 28037 Madrid (Spain)
> Tel. (+34) 91 440 8800 (Ext. 7629)
> -----Mensaje original-----
> De: Landon Cox []
> Enviado el: miercoles, 01 de mayo de 2002 1:52
> Para: Lucene Users List
> Asunto: best practice for indexing multiple equiv fieldnames
> >
> >I'm planning to use Lucene to index scads of XML files whose data model
> >includes replicated blocks of tags.  Translation: a novice question
> follows.
> >
> >My files have a common XML pattern (for illustrative purposes):
> >
> ><blocks>
> >   <block id="123">some text 1</block>
> >   <block id="456">some text 2</block>
> >   <block id="789">some text 3</block>
> ></blocks>
> >
> >Each block has a unique id, but the tagname is identical.  The actual
> >model has nested tags within these blocks - ie: metadata with the same
> >tagnames within each block.  So, in the real data model, there are
> multiple
> >identical tagnames that are associated with a specific parent.  Something
> >more like this:
> >
> ><blocks>
> >   <block id="123">
> >      <author>Joe Blow</author>
> >      <job>hack</job>
> >   </block>
> >   <block id="456">
> >      <author>Jane Doe</author>
> >      <job>President</job>
> >   </block>
> ></blocks>
> >
> >In latter case, I need to be able to search by author or job, for
> >and get the tag's text contents as well as the parent block id.
> >
> >Adding a field name of "block" or "author" or "job" multiple times to the
> >same Lucene Document, according to the Lucene javadoc, has the effect of
> >appending the text for search purposes.  I take that to mean, in order to
> >use a 'hit' I would need to somehow uniquely identify the field from
> >the content came even though the content was appended for search
> >
> >If I searched an 'author' field name and got a hit, I would not be able
> >disambiguate which block id the actual hit belonged to.  Or if I searched
> on
> >"job", how would I know a hit belonged to block id 456 instead of block
> >123 parent?
> >
> >What is the Lucene approach for indexing a single document that has the
> same
> >field name appearing in multiple places and then using the hit to find
> >exact association of block id in the above example?
> >
> >Hope this question makes sense.  I'm sure I'm missing something
> >obvious/simple in how the API would work in this case.  Thanks,
> >
> >Landon Cox
