lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TJ Kolev <tjko...@gmail.com>
Subject Re: Problem: Indexing and searching repeating groups of fields
Date Fri, 22 Jan 2010 18:23:10 GMT
The issue is that the real world document has more than 2 fields. Me giving
an example of two was a bit misleading. I can't really "pair" them. Here's a
better example:

Resume_a
  Exp_1
    Language:Java, Years:5, Certification:Sun, Area:Web
  Exp_2
    Language:C, Years:3, Certification:None, Area:Desktop

And yes, I need to be able to query for ranges on the Years field. So for
now, I'll do as much in Lucene, and then post-filter on my own any extra
documents in the search result.

Thank you.
tjk :)

On Fri, Jan 15, 2010 at 11:31 AM, Erick Erickson <erickerickson@gmail.com>wrote:

> Well, a variant on the easy solution might. What would
> happen if you indexed the un-split pairs in the same
> field? I.e.
>
> "java:5", "c:3", "php:2" all indexed as *single* tokens
> in the *same* field?
>
> But I think you should look at Digy's suggestion again.
> 6-10 fields is absolutely no problem at all. Documents
> in Lucene can have any number of fields without
> any relation to any other document. So having
> "sparse" documents has no penalty (unlike a RDBMS).
>
> In particular, with Digy's scheme you can answer questions
> like "has more than 3 years of C experience" whereas
> with my scheme it would be harder.
>
>
> FWIW
> Erick
>
> On Fri, Jan 15, 2010 at 11:19 AM, TJ Kolev <tjkolev@gmail.com> wrote:
>
> > Hi!
> >
> > I don't think the easy solution will work for me, because I'll have more
> > than two fields in a group - perhaps 6 - 10.
> >
> > However using span queries looks very promising. I'll investigate that.
> >
> > I see setPositionIncrement() only on the Token object. Is there a way to
> > set
> > this when adding a field to the document, so that the first token get its
> > position pushed away. I would prefer not to modify my analyzer if
> possible.
> >
> > Thank you.
> > tjk :)
> >
> > On Wed, Jan 13, 2010 at 3:52 PM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > Ooooh, isn't that easier. You just prompted me to think
> > > that you don't even have to do that, just index the pairs as single
> > > tokens (KeywordAnalyzer? but watch out for no case folding)...
> > >
> > > On Wed, Jan 13, 2010 at 4:30 PM, Digy <digydigy@gmail.com> wrote:
> > >
> > > > How about using languages as fieldnames?
> > > > Doc1(Ra):
> > > >        Java:5
> > > >        C:2
> > > >        PHP:3
> > > >
> > > > Doc2(Rb)
> > > >        Java:2
> > > >        C:5
> > > >        VB:1
> > > >
> > > > Query:Java:5 AND C:2
> > > >
> > > > DIGY
> > > >
> > > > -----Original Message-----
> > > > From: TJ Kolev [mailto:tjkolev@gmail.com]
> > > > Sent: Wednesday, January 13, 2010 11:00 PM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Problem: Indexing and searching repeating groups of fields
> > > >
> > > > Greetings,
> > > >
> > > > Let's assume I have to index and search "resume" documents. Two
> fields
> > > are
> > > > defined: Language and Years. The fields are associated together in a
> > > group
> > > > called Experience. A resume document may have 0 or more Experience
> > > groups:
> > > >
> > > > Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
> > > > Rb{ E1(Java,2); E2(C,5); E3(VB,1);}
> > > >
> > > > How do I index such documents, and how do I search, so I can
> formulate
> > a
> > > > query like this "Resumes which have (Java,5) and (C,2)" and get back
> > Ra.
> > > I
> > > > know I can index multiple fields of the same name, and do
> > "(Language:Java
> > > > AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra
> that
> > > > would
> > > > also return Rb, which I don't want. The problem here is that the
> > > "grouping"
> > > > is lost. I can create fields with compound names (E1Language,
> E1Years,
> > > > E2Language, E2Years, etc), but that helps me none, as I don't know
> > which
> > > > group to search. I'd also like to query for "(Language:Java AND
> > Years:5)
> > > OR
> > > > (Language:C AND Years:2)"
> > > >
> > > > This is a simplified example. Real documents may have 30 - 40 groups,
> > > each
> > > > one with several fields. Putting all the fields in a group in one
> index
> > > > field won't work as the numeric/date ones should be available for
> range
> > > > searchers.
> > > >
> > > > So far the way I see it is to do my own post processing on the
> results.
> > > The
> > > > issue is that text fields will need to be untokenized, or otherwise
> it
> > > > would
> > > > be difficult to work on the result, and determine what matches.
> > > >
> > > > Thank you.
> > > > tjk :)
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message