Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Content-Type: text/plain;
  charset="iso-8859-1"
From: Tatu Saloranta <tatu@hypermall.net>
Reply-To: tatu@hypermall.net
Organization: Linux-users missalie
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Subject: Re: Iterators for collecting Terms from Queries
Date: Fri, 14 Mar 2003 22:41:38 -0700
User-Agent: KMail/1.4.3
References: <HOJHPAGGICOEIAAA@mailcity.com>
In-Reply-To: <HOJHPAGGICOEIAAA@mailcity.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Message-Id: <200303142241.38464.tatu@hypermall.net>

On Friday 14 March 2003 10:17, none none wrote:
> hi Tatu,

First of all, thanks for feedback. It's good to discuss various implementation 
strategies. I think there are different goals and trade-offs in our 
approaches. And I do admit mine is bit over-engineered in a way (more on that 
later on).

> i didn't really look at all the code, but at a first looks nice, i like the
> idea, but i have something to say. When we run a search, we collect all the
> terms already (in a previous email i mentioned something about that, see
> "rewrite" method), your idea is very elegant from a programming-style point
> of view, but i believe it slow down performance compared to mine. My idea

Yes, it's not optimized for performance. For most use cases I'm not sure this 
is a big issue, both memory and performance overheads should be fairly small. 
One exception would be prefix/wildcard queries that expand to big number of 
Terms, or when having massive number of connections.
I would think that executing search, doing reindexing (for real-time 
updateable systems) and actual term highlighting are more cpu sensitive than 
term collection.

That being said, yes, more straight-forward solution for collecting is more 
efficient than my iterator-based solution (for actual terms; collecting base 
terms is trivially fast in both cases). The only case where iterators might 
be faster is if you only want to get some of the terms (for example, skip 
Terms for Range query); if so, it might be more efficient as it only needs to 
fetch and store subset of Terms (whereas if there's just single flags, all 
Terms are always collected).

> doesn't add any extra class, or the most it can is just one, a lot of
> changes are done in the lucene core, so the main difference can be seen as
> follow: My case
> 1) set a boolean value inside Query class to true: collectTerms(true);

(this is just a minor implementation suggestion)
I think perhaps this flag could be passed to Query when executing query, not 
stored in Query object? This because it's not really a property of Query 
object but property of execution of seach (whether to keep track of Terms so 
they can be requested from Query, or returned along with Search results).
This would require changes to Query classes however.

> 2) run the search
> 3) the searcher (reader actually) will call the method "rewrite" or some
> other methods, inside this method we check if the user want collect the
> terms testing the public boolean collectTerms. This is to avoid consumption
> of un-necessary memory by a user that doesn't need to collect the terms. 4)

Makes sense.

> terms are now in memory in different query classes, depends on the "user
> query", e.g.: a boolean query of 2 multitermquery. so the user can collect
> them the way he wants and use them. e.g.: i collect them and store in an
> array of Clauses, someone may just want to put in an array.

One problem I tried to solve was that user shouldn't have to know structure of 
Query classes (that's what visitor pattern in general solves), while still 
allowing access to some useful properties, such as optional/reqd/prohibited 
flag that's only available in BooleanClause, not in queries (iterator keeps 
track of those flags and allows them to be accessed as if they were 
properties of queries themselves).

Note however that your method could be changed to do similar recursive
traversal (if it doesn't already do that, I may have misunderstood your 
explanation?) for simple cases, so that caller wouldn't have to know the 
structure, if it only needs terms, not context (ie. need not know which Term 
came from which query; sometimes this is needed, esp. with phrase queries).

> 1) run the search

[you can also collect Terms before running the search if necessary, since in 
any case they are calculated twice like you point out]

> 2) the searcher collect all the terms because it needs due to produce
> rsearch results. 3) use your term collector to collect the terms. ATTN:
> this will do something that has been done already by the searcher! so, i
> think it is a waste of resources and time, and as result performance slows
> down.

Like I said above, while you are right that it does have overhead (computing 
terms twice), I'm not sure how significant that would be in general, compared 
to search, scoring etc.
It would be good to do some simple tests to see if I'm wrong here and Term 
collection is actually big part of execution time.

> I want underline that the time to put an object in an array and get it back
> is still the same, the difference is call the reader twice instead of one.

Yes, that is correct.

> I am not sure how much is the difference between the two cases, but for
> logic i think there has to be, even more when we dial with prefixquery or
> rangequery (that's where mainly we need the collector actually!). It may
> sounds weird, but i lost all the data on my pc, this monday, so i can't
> compare them, also i have to implement my idea again..

Ack. Sorry to hear that.

> Let me know what you think,
> Ciao.

One other thing I was thinking about was refactoring Range and Prefix queries 
to be MultiTermQuery - based. I think that should benefit both solutions.

Plus, it seems to me that PhrasePrefixQuery perhaps should just be rewritten. 
It acts very different from other queries, requiring caller to expand terms 
when it's being built. It seems like it perhaps should work more like plain 
PrefixQuery, and do expansion only when being executed. Otherwise one
has to build new Query for each search execution, if index has changed.

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org