lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven D. Majewski" <>
Subject Re: Lucene search question
Date Tue, 13 Nov 2007 16:59:34 GMT

On Nov 13, 2007, at 7:21 AM, Cláudio Fernandes wrote:

> Hello all,
> I don't know if this is a somehow naive question, but here we go:
> Does Lucene support index by sections? Like having a text document  
> with
> three sections divided by XML tags indexed in a way we could do a  
> search
> by work and section. Does Lucene itself support this kind of  
> indexing or
> should it be used with other engines like Cocoon?
> Thanks in advance for your time,

Depends on what you mean by sections.
If your document divides up simply into fixed fields:
      <title>...</title>, <author>...</author> , <body>...</body>
or:  <part1>...</part1>, <part2>...</part2>, <part3>...</part3>
then you can make those into fields of your lucene index.

But if there aren't a fixed number of sections, then fields probably  
work. Lucene doesn't itself handle nesting or inclusion, so finding
text within some arbitrary div or finding the div holding the text
is not so straightforward. However, lucene has a flexible notion
of what a 'document' is. ( Basically, it's whatever unit you feed
it as a document. ) So if this is what you need, you might be able
to make each <div> into a "document" rather than each file.

  If you were indexing a large TEI text and wanted to return a  
chapter where the text was found, you could make each chapter a  
and each document would have indexed fields to store the common header
info as well as the file name containing the chapter.

  Lucene is great at finding documents, but not quite as good at finding
things IN documents. The index contains pointers to the terms, but  
they are
pointers to a token in the parsed token stream, so to find a  
character index
into a file, you have to (I believe) run the text thru the tokenizer  
( But lucene API gives you access to everything, even if it's not  
simple or easy.
   I think there are some new features in the latest version that can  
make this
   sort of thing easier, but I haven't yet figured out how to use  
them. )

-- Steve Majewski 

( Not much of a lucene expert, but I've spent some time figuring out  
the difference
   between document indexers like lucene and text indexers like xpat/ 
opentext.  )

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message