lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "JMA" <mrj...@comcast.net>
Subject RE: Breaking up text in fields or aggregate fields idea or field inheritance
Date Sat, 27 May 2006 04:46:27 GMT

Thankyou Chuck,
Analyzer.getPositionIncrementGap() appears to be what I need.
JMA

-----Original Message-----
From: Chuck Williams [mailto:chuck@manawiz.com]
Sent: Thursday, May 25, 2006 12:50 PM
To: java-dev@lucene.apache.org
Subject: Re: Breaking up text in fields or aggregate fields idea or
field inheritance


JMA,

I think you will find that multiple fields are beneficial.  However, a
simple answer to your question, and one that is needed even for your
examples of multiple values in a single field, is to use a position
increment gap.  See Analyzer.getPositionIncrementGap().

When you use multiple values in the same field as in your example with
worker:john douglas and worker:davis raymond (two values for worker each
with two tokens), the values get appended in the index, so it appears to
the index as if it was worker:john douglas davis raymond.  However, with
getPositionIncrementGap() you can make it look like worker:john douglas
<gap> davis raymond.  The <gap> will prevent "douglas davis" frolm
matching.  By using an appropriately sized gap you can still support
near queries (e.g., "john douglas"~3 to match "john spencer douglas").

There is a difference between the stored fields and the positions of
tokens in the index.  If you store the workers field, you will get back
your separate field values just as you indexed them.  However, they are
appended from an index perspective.

Good luck,

Chuck


JMA wrote on 05/24/2006 11:00 PM:
> Greetings,
>
> I am struggling with the following.  Say I want to use Lucene to search a
> corporate phone book where I have workers from a database:
>
> Workers
> -------
> John Douglas
> Davis Raymond
>
> My first thought was to create a field called workers and put all the
names
> in it:
>
> worker: john douglas davis raymond
>
> This works ok except now a search for "douglas davis" returns a hit, when
no
> such person exists. So to fix, create a workers field for every person:
>
> worker: john douglas
> worker: davis raymond
>
> Ok all set, because I can just prepend "worker:" to any search.  But now
say
> I want to add a new category called 'manager':
>
> Managers
> --------
> Mark Smith
> Pearson Jones
>
> I can do the same thing, but now I have two field types, and the search
> input is getting more complex.  All I want to do is have one big field
with
> a separator where I need it:
>
> content: john douglas \n davis raymond \n mark smith \n pearson jones
>
> But when I try this, the "\n" character is treated as a space, so phrase
> searches find people that do not exist.  Now I think I can make a custom
> filter to fix this, but is there an easy way to do this?  Is there a
> puncuation character that 'splits' text to avoid phrase search hits?  Is
> there an 'aggregate field' or 'field inheritance' function, such as
> content=<worker>+<manager>?
>
> Thanks in advance,
> JMA
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message