lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: I just don't get wildcards at all.
Date Sun, 09 Apr 2006 22:42:27 GMT
I think I'm almost there, thanks y'all have been a great help. Here are, I
hope, my last questions and I'll be all done beating on this....

Let's claim that all my clauses contain wildcards. What I *think* that means
is that I can't very well use a filter "the normal way" since seachers
require a query. And I don't want a query with a wildcard term.

So here's what I worked out today....

Postulate 3 clauses, all with wildcards. I'm returning the top 250 matches.
NOTE: the point of my tests is to see if I can break Lucene. So far I've
only been able to make it go slow. Very cool.

There are 1M docs. The index is 3G. I'm wildcarding over the field (not
stored) that, when stored, accounted for, 70% of the size of the index (the
index was 10G when storing this field).

It's easy enough (and I'm still stunned at how fast it happens) to construct
a filter that aggregates the three clauses using WildcardTermEnum. I found
the MatchAllQuery, and tried using that and passing it the filter I
constructed to the searcher, something like... MatchAllDocsQuery(), mynewfilter);

This is painfully slow. So I got clever and just iterated through the bitset
in mynewfilter, pulling out the chunk of docs I wanted by putting the
following in a loop.

doc = indexreader.document(next set bit in the bitset);
<extract the relevant info and package it up>

This runs about 40 times faster.

So here are my questions:
1> Did I misuse/misunderstand MatchAllDocs? What's it for anyway if not
2> Since all the terms have wildcards, I don't get ranking etc. anyway.
right? So I'm not losing anything by messing with the bitset myself, right?
3> I should create a BooleanQuery (or equivalent) on any terms that do NOT
have wildcards and pass the filter to the searcher in order to get some
rankings/relevance. And one expects that to perform substantially better
than using MatchAllDocs. Yes? No?
4> In my specific case, I don't believe caching filters helps me because the
chances of any of my search terms being the same across requests is small.
Given that, is there anything but convenience to using a ChainedFilter? In
my crude testing, I just declared another bitset, populated it and then
anded/ored/andnoted it to the bitset returned from my filter. Don't worry,
I'm going to chain them, I'm just checking my understanding.

Thanks again for all your patience. I'm more impressed than ever. My target
qps is 2. I'm hitting 11. And that's not even claiming the other 3 machines
that I can have if I want <GGGGG>.

Erick Erickson

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message