lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Landon Cox" <>
Subject RE: best practice for indexing multiple equiv fieldnames
Date Wed, 08 May 2002 14:41:13 GMT

Thanks to Alexander and Brandon for replying.

I've made a lot of progress since first posting.  I did end up taking the
approach of one Lucene document per XML element.  That seemed to be the most
flexible if not the most intuitive approach initially.

In that process I also attached various bits of auxilliary information to
the lucene doc needed to find my way back to the element including the path
to the file itself as well as the XML attributes on that tag.  Makes
indexing/search XML attributes as easy as the tag content itself, so that
was cool.

Because, as Brandon and Alexander pointed out, the elements are in separate
Lucene documents, there's a second aggregation phase that's required to
filter all the Lucene doc hits down to the unique list of actual XML file

Even the post-process works very fast as far as I can tell - I stuff a hash
table with hits where the key is the hit's path to the file (stored when
indexed), then I iterate through the hash.  Multiple hits will have the same
XML path so the hash key effectively creates list of unique files which
qualify as a 'final' hit.  Granted, it's not sorted that point, but
nonetheless it boils down hits to the unique list very quickly.

For my application and data-model, the typical expansion was 1 XML file to
about 50 elements (lucene docs).  It was reassuring to see some of the scale
numbers posted to the list in the last month re: indexing in the 15-20
million doc range, so the expansion in my case was not too large of a
concern other than initial indexing of that much info will be quite long.

I also have implemented pre-filtering on some tags and attribute names that
are simply internal info to the XML data model, never to be searched, so
those don't get in at all.

Not sure how you're addressing the element ala "A DOM tree location is
stored", but it made sense to me to store an XPath query string so JDOM
could address the element directly if need be once the hit is selected in
the application.  For that, when the Lucene document is created during
indexing, it's a tree walk up from the element to build the XPath query and
then the query string is attached to the Lucene document unindexed.   Would
be nice for JDOM to return the XPath to an Element object, but couldn't find
anything like that...perhaps I'm overlooking something in JDOM.

In any case, I've progressed quite far and really appreciate Lucene the more
I get into it.  I appreciate the responses and general approach advice as
well - it got me started in a productive direction.  Thanks,

Landon Cox

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message