lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: Problem: Indexing and searching repeating groups of fields
Date Wed, 13 Jan 2010 21:17:59 GMT
One approach would be to do this with multi-valued fields. The
idea here is to index all your E fields in the *same* Lucene
field with an increment gap (see getPositionIncrementGap) > 1.

For this example, assume getPositionIncrementGap returns 100.

Then, for your documents you have something like....
doc.add(new Field("experience", "java,5" blah blah));
doc.add(new Field("experience", "C,2" blah blah));
doc.add(new Field("experience", "PHP,3" blah blah));

Then you do proximity searches with a slop of < 100.

The trick is that, the above tokens are positioned (roughly)
1 - java
2 - 5
102 - c
103 - 2
203 - php
204 - 3

Of course you have to override a suitable analyzer to break
your tokens up appropriately.

Now a query (SpanNear? Proximity? your choice) of the
form "java 5"~90 AND "c 2"~90 should only return Ra.


On Wed, Jan 13, 2010 at 3:59 PM, TJ Kolev <> wrote:

> Greetings,
> Let's assume I have to index and search "resume" documents. Two fields are
> defined: Language and Years. The fields are associated together in a group
> called Experience. A resume document may have 0 or more Experience groups:
> Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
> Rb{ E1(Java,2); E2(C,5); E3(VB,1);}
> How do I index such documents, and how do I search, so I can formulate a
> query like this "Resumes which have (Java,5) and (C,2)" and get back Ra. I
> know I can index multiple fields of the same name, and do "(Language:Java
> AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra that
> would
> also return Rb, which I don't want. The problem here is that the "grouping"
> is lost. I can create fields with compound names (E1Language, E1Years,
> E2Language, E2Years, etc), but that helps me none, as I don't know which
> group to search. I'd also like to query for "(Language:Java AND Years:5) OR
> (Language:C AND Years:2)"
> This is a simplified example. Real documents may have 30 - 40 groups, each
> one with several fields. Putting all the fields in a group in one index
> field won't work as the numeric/date ones should be available for range
> searchers.
> So far the way I see it is to do my own post processing on the results. The
> issue is that text fields will need to be untokenized, or otherwise it
> would
> be difficult to work on the result, and determine what matches.
> Thank you.
> tjk :)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message