lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ype Kingma <ykin...@xs4all.nl>
Subject Re: too many hits - OutOfMemoryError; Low frequency terms
Date Fri, 30 May 2003 05:38:58 GMT
Doug,

On Thursday 29 May 2003 09:35, Doug Cutting wrote:
> Ype Kingma wrote:
> > The source of the problem is with the wildcards, so wouldn't be better
> > to enforce a max. nr of expanded terms on these types of queries?
> > That would allow finer control than on 'top level'.
>
> That would provide more flexibility, but also more complexity.  There
> are three types of query that expand into BooleanQuery: FuzzyQuery,
> PrefixQuery, and WildcardQuery.  FuzzyQuery and WildcardQuery share a
> base class (MultiTermQuery), so they could be controlled by a single
> parameter, or one could make the parameter specific to each, or both.
> PrefixQuery would need a separate parameter.
>
> My inclination is to first add the top-level parameter to BooleanQuery
> that limits all of these.  Then, if finer-grained control is desired, we
> could add more parameters.
>
> > Also it would be possible to interact when the number of expanded
> > terms grows out of control: ie. does the user really want
> > all these expanded terms, or would the user prefer to select
> > some of the exanded terms?
>
> That's an interesting thought.  What criteria would you use for
> selection?  One might limit the expansion to the more frequent terms.
> Do folks think that would be useful?  Is someone interested in
> implementing it?

I think the actual interaction needed for term selection by a users
should be left out of Lucene. That leaves an API for subset
selection from a set of terms,  which should be straightforward.

Limiting to more frequent terms is very dependant on the the users'
intention. Eg. when one needs high recall, it's not advisable.

> My hunch is that most queries that expand to large numbers of terms are
> not useful queries.  They're also very slow, and many (most?) users
> might not wait for results anyway.  I think it's better to get an error
> message up front indicating that the query is too vague.
>
> > I realize such interaction features are not needed for the avarage
> > user, so the only thing I'd like to have is that Lucene allows for
> > adding such features without needing to move Lucene functionality
> > though it's class hierarchy.
>
> Lucene allows for adding whatever features folks wish to contribute!  So
> if you have a concrete idea for a term expansion API, or, better yet,
> and implementation, please send it.

The implicit good news for me is that you don't think of such features
as infeasible in combination with Lucene.
As I said, I haven't even looked at the actual details of term expansion.

I happen to be familiar with a query language in which user selection
of expanded terms is possible. It also requires prefix queries to at have 
least 3 characters before the first truncation in order to limit the term
expansion.


Your mention of selecting terms with high frequency brings me to
another point.
Terms that inadvertantly have a low document frequency (spelling
errors for example), get a term relevancy in query execution that
is higher than they actually deserve.
This problem surfaces when term expansion results in such terms. 
Is there a way in Lucene to give all expanded terms the same relevancy?

The problem also surfaces when two synonyms
have a very different document frequency and these synonyms
are used together in a query. In this situation one can compensate by
using appropriate query term weights, but a special OR operator
for synonyms might be preferable.

Kind regards,
Ype Kingma


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message