lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Roy Klein" <kl...@sitescape.com>
Subject RE: Indexing multiple instances of the same field for each document
Date Sun, 29 Feb 2004 03:05:42 GMT
Erik,

Indexing a single field in chunks solves a design problem I'm working
on. It's not the only way to do it, but, it would certainly be the most
straightforward.  However, if using this method makes phrase searching
unusable, then I'll have to go another route.

Here's a brief example of the type of thing I'm trying to do:

I have a file that contains the words:

The quick brown fox jumped over the lazy dog.

I run that file through a utility that produces the following xml
document:
<document>
  <field name=wordposition1>
    <word>The</word>
  </field>
  <field name=wordposition2>
    <word>quick</word>
    <word>fast</word>
    <word>speedy</word>
  </field>
  <field name=wordposition3>
    <word>brown</word>
    <word>tan</word>
    <word>dark</word>
  </field>
  .
  .
  .

I parse that document (via the digester), and add all the words from
each of the fields to one lucene field: "contents".  The tricky part is
that I want to have each word position contain all the words at that
position in the lucene index.  I.e. word location 1 in the index
contains "The", word location 2: "quick, fast, and speedy", word
location 3: "brown, tan, and dark", etc.

That way, all the following phrase queries will match this document:
	"fast tan"
	"quick brown"
      "fast brown"

I wrote a "TermAnalyzer" that adds all the words from a field into the
index at the same position. (via setPositionIncrement(0)).  That way I
can simply add each set of words to the "contents" field, and it'll just
keep adding them to the same field.  However, since it's reversing them,
I can't match phrases.


    Roy

(I just looked at the Document class, seems like it shouldn't be that
difficult to make the DocumentFieldList add new fields onto the end
instead of the beginning of the list.  I'll try to change it, and submit
a fix once I get it working.)


-----Original Message-----
From: Erik Hatcher [mailto:erik@ehatchersolutions.com] 
Sent: Friday, February 27, 2004 10:28 PM
To: Lucene Users List
Subject: Re: Indexing multiple instances of the same field for each
document


>I don't personally see why you would index text in chunks like this 
>rather than aggregating it all into a Reader or String, so certainly 
>this is an uncommon usage pattern.
>
>	Erik



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message