lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Problem: Indexing and searching repeating groups of fields
Date Fri, 15 Jan 2010 17:31:07 GMT
Well, a variant on the easy solution might. What would
happen if you indexed the un-split pairs in the same
field? I.e.

"java:5", "c:3", "php:2" all indexed as *single* tokens
in the *same* field?

But I think you should look at Digy's suggestion again.
6-10 fields is absolutely no problem at all. Documents
in Lucene can have any number of fields without
any relation to any other document. So having
"sparse" documents has no penalty (unlike a RDBMS).

In particular, with Digy's scheme you can answer questions
like "has more than 3 years of C experience" whereas
with my scheme it would be harder.


FWIW
Erick

On Fri, Jan 15, 2010 at 11:19 AM, TJ Kolev <tjkolev@gmail.com> wrote:

> Hi!
>
> I don't think the easy solution will work for me, because I'll have more
> than two fields in a group - perhaps 6 - 10.
>
> However using span queries looks very promising. I'll investigate that.
>
> I see setPositionIncrement() only on the Token object. Is there a way to
> set
> this when adding a field to the document, so that the first token get its
> position pushed away. I would prefer not to modify my analyzer if possible.
>
> Thank you.
> tjk :)
>
> On Wed, Jan 13, 2010 at 3:52 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Ooooh, isn't that easier. You just prompted me to think
> > that you don't even have to do that, just index the pairs as single
> > tokens (KeywordAnalyzer? but watch out for no case folding)...
> >
> > On Wed, Jan 13, 2010 at 4:30 PM, Digy <digydigy@gmail.com> wrote:
> >
> > > How about using languages as fieldnames?
> > > Doc1(Ra):
> > >        Java:5
> > >        C:2
> > >        PHP:3
> > >
> > > Doc2(Rb)
> > >        Java:2
> > >        C:5
> > >        VB:1
> > >
> > > Query:Java:5 AND C:2
> > >
> > > DIGY
> > >
> > > -----Original Message-----
> > > From: TJ Kolev [mailto:tjkolev@gmail.com]
> > > Sent: Wednesday, January 13, 2010 11:00 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Problem: Indexing and searching repeating groups of fields
> > >
> > > Greetings,
> > >
> > > Let's assume I have to index and search "resume" documents. Two fields
> > are
> > > defined: Language and Years. The fields are associated together in a
> > group
> > > called Experience. A resume document may have 0 or more Experience
> > groups:
> > >
> > > Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
> > > Rb{ E1(Java,2); E2(C,5); E3(VB,1);}
> > >
> > > How do I index such documents, and how do I search, so I can formulate
> a
> > > query like this "Resumes which have (Java,5) and (C,2)" and get back
> Ra.
> > I
> > > know I can index multiple fields of the same name, and do
> "(Language:Java
> > > AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra that
> > > would
> > > also return Rb, which I don't want. The problem here is that the
> > "grouping"
> > > is lost. I can create fields with compound names (E1Language, E1Years,
> > > E2Language, E2Years, etc), but that helps me none, as I don't know
> which
> > > group to search. I'd also like to query for "(Language:Java AND
> Years:5)
> > OR
> > > (Language:C AND Years:2)"
> > >
> > > This is a simplified example. Real documents may have 30 - 40 groups,
> > each
> > > one with several fields. Putting all the fields in a group in one index
> > > field won't work as the numeric/date ones should be available for range
> > > searchers.
> > >
> > > So far the way I see it is to do my own post processing on the results.
> > The
> > > issue is that text fields will need to be untokenized, or otherwise it
> > > would
> > > be difficult to work on the result, and determine what matches.
> > >
> > > Thank you.
> > > tjk :)
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message