lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chuck Williams <>
Subject Re: Breaking up text in fields or aggregate fields idea or field inheritance
Date Thu, 25 May 2006 16:49:39 GMT

I think you will find that multiple fields are beneficial.  However, a
simple answer to your question, and one that is needed even for your
examples of multiple values in a single field, is to use a position
increment gap.  See Analyzer.getPositionIncrementGap().

When you use multiple values in the same field as in your example with
worker:john douglas and worker:davis raymond (two values for worker each
with two tokens), the values get appended in the index, so it appears to
the index as if it was worker:john douglas davis raymond.  However, with
getPositionIncrementGap() you can make it look like worker:john douglas
<gap> davis raymond.  The <gap> will prevent "douglas davis" frolm
matching.  By using an appropriately sized gap you can still support
near queries (e.g., "john douglas"~3 to match "john spencer douglas").

There is a difference between the stored fields and the positions of
tokens in the index.  If you store the workers field, you will get back
your separate field values just as you indexed them.  However, they are
appended from an index perspective.

Good luck,


JMA wrote on 05/24/2006 11:00 PM:
> Greetings,
> I am struggling with the following.  Say I want to use Lucene to search a
> corporate phone book where I have workers from a database:
> Workers
> -------
> John Douglas
> Davis Raymond
> My first thought was to create a field called workers and put all the names
> in it:
> worker: john douglas davis raymond
> This works ok except now a search for "douglas davis" returns a hit, when no
> such person exists. So to fix, create a workers field for every person:
> worker: john douglas
> worker: davis raymond
> Ok all set, because I can just prepend "worker:" to any search.  But now say
> I want to add a new category called 'manager':
> Managers
> --------
> Mark Smith
> Pearson Jones
> I can do the same thing, but now I have two field types, and the search
> input is getting more complex.  All I want to do is have one big field with
> a separator where I need it:
> content: john douglas \n davis raymond \n mark smith \n pearson jones
> But when I try this, the "\n" character is treated as a space, so phrase
> searches find people that do not exist.  Now I think I can make a custom
> filter to fix this, but is there an easy way to do this?  Is there a
> puncuation character that 'splits' text to avoid phrase search hits?  Is
> there an 'aggregate field' or 'field inheritance' function, such as
> content=<worker>+<manager>?
> Thanks in advance,
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message