lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin <cry...@yahoo.com>
Subject Re: InverseWildcardQuery
Date Fri, 30 Jul 2010 20:31:22 GMT
> make both a stemmed field and an unstemmed field

While this approach is easy and would work, it means increasing the size of the 
index and reindexing every document. However, the information is already 
available in the existing field and runtime analysis is certainly faster than 
more disk I/O.

> Changing the index at query time

No... I'm sure I left out enough details to be confusing. The additional fields 
in our current workaround need to be managed periodically over the course of a 
document's lifetime, not at query time. To be honest, I'm not involved on the 
workaround. I'm just trying to avoid going down the wrong path.

> Query.html#rewrite

So it sounds like I should attempt to write my own InverseWildcardQuery class 
which overrides this method. Perhaps I can use WildcardQuery as an example? 
WildcardQuery has changed parent classes over recent versions and and we're 
unfortunately using the older at the moment.

I guess I had hoped others have handled this type of query before. Again, we're 
just trying to find all documents that have at least one field that  doesn't 
match the wildcard query.





----- Original Message ----
From: Steven A Rowe <sarowe@syr.edu>
To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
Sent: Fri, July 30, 2010 3:01:14 PM
Subject: RE: InverseWildcardQuery

> > you want what Lucene already does, but that's clearly not true
> 
> Hmmm, let's pretend that "contents" field in my example wasn't analyzed at 
>index
> time. The unstemmed form of terms will be indexed. But if I query with a 
>stemmed
> form or use QueryParser with the SnowballAnalyzer, I'm not going to get a 
>match.
> I could fix this situation by analyzing the indexed field at search time to
> match the query. I don't know that Lucene provides this opportunity and, as I
> said, maybe that's crazy.

I don't know of any facility in Lucene to re-analyze indexed content at search 
time.  This is an IR anti-pattern - it defeats the purpose of constructing the 
inverted index.

People generally decide prior to indexing what kind of queries they need to 
support, and then perform all required document analysis at index time.  To use 
your stemming example, if you need to be able to match against both stemmed and 
unstemmed forms, you would make both a stemmed field and an unstemmed field, and 
then construct corresponding sub-queries against each as required.

> > What do you mean when you say "rewriting"
> 
> "Our current path to solving our problem requires additional fields which need
> rewritten". I meant actually altering the document in the index. My desire has
> been to write a new Query class implementation whereas you mentioned query
> rewriting (isn't that accomplished as in my example by passing an Analyzer to
> QueryParser?)

Changing the index at query time, unless you have a tiny set of documents, 
sounds like a big mistake to me.  Again, extremely expensive.

Query rewriting: 
<http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Query.html#rewrite%28org.apache.lucene.index.IndexReader%29>


> > not discrete...  limited number of prefixes
> 
> So my document may have "myfield:A*foo*", "myfield:B*foo*",
> "myfield:A*dog*", and "myfield:D*cat*".
> 
> Or, to phrase differently, "myfield:[PREFIX][PATTERN]" may appear any
> number of times where PREFIX comes from the set { A, B, C, D, E, ... }.
> 
> This complexity is really a tangent of my question in order to avoid poor
> performance from WildcardQuery.

I still think you could make one field for each PREFIX, and then do whatever 
query you want on each of those fields.

If you need to support "contains" functionality (e.g. "A:*foo*"), you might want 
to look into ngram analysis at indexing time (and maybe also at query time, 
depending on the source of your queries):

<http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/analysis/ngram/NGramTokenFilter.html>


You would have to know the minimum and maximum length of the contained string 
you would want to query for.

Steve


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message