Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 54213 invoked from network); 15 Mar 2003 05:32:53 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 15 Mar 2003 05:32:53 -0000 Received: (qmail 24834 invoked by uid 97); 15 Mar 2003 05:34:46 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@nagoya.betaversion.org Received: (qmail 24827 invoked from network); 15 Mar 2003 05:34:45 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 15 Mar 2003 05:34:45 -0000 Received: (qmail 53924 invoked by uid 500); 15 Mar 2003 05:32:51 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 53913 invoked from network); 15 Mar 2003 05:32:50 -0000 Received: from mail2.hypermall.com (216.241.37.118) by daedalus.apache.org with SMTP; 15 Mar 2003 05:32:50 -0000 Received: from [216.241.38.72] (helo=www.doomdark.org) by mail2.hypermall.com with esmtp (Exim 3.36 #1) id 18u4I0-0003AY-00 for lucene-dev@jakarta.apache.org; Fri, 14 Mar 2003 22:33:00 -0700 Content-Type: text/plain; charset="iso-8859-1" From: Tatu Saloranta Reply-To: tatu@hypermall.net Organization: Linux-users missalie To: "Lucene Developers List" Subject: Re: Iterators for collecting Terms from Queries Date: Fri, 14 Mar 2003 22:41:38 -0700 User-Agent: KMail/1.4.3 References: In-Reply-To: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-Id: <200303142241.38464.tatu@hypermall.net> X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N On Friday 14 March 2003 10:17, none none wrote: > hi Tatu, First of all, thanks for feedback. It's good to discuss various implementation strategies. I think there are different goals and trade-offs in our approaches. And I do admit mine is bit over-engineered in a way (more on that later on). > i didn't really look at all the code, but at a first looks nice, i like the > idea, but i have something to say. When we run a search, we collect all the > terms already (in a previous email i mentioned something about that, see > "rewrite" method), your idea is very elegant from a programming-style point > of view, but i believe it slow down performance compared to mine. My idea Yes, it's not optimized for performance. For most use cases I'm not sure this is a big issue, both memory and performance overheads should be fairly small. One exception would be prefix/wildcard queries that expand to big number of Terms, or when having massive number of connections. I would think that executing search, doing reindexing (for real-time updateable systems) and actual term highlighting are more cpu sensitive than term collection. That being said, yes, more straight-forward solution for collecting is more efficient than my iterator-based solution (for actual terms; collecting base terms is trivially fast in both cases). The only case where iterators might be faster is if you only want to get some of the terms (for example, skip Terms for Range query); if so, it might be more efficient as it only needs to fetch and store subset of Terms (whereas if there's just single flags, all Terms are always collected). > doesn't add any extra class, or the most it can is just one, a lot of > changes are done in the lucene core, so the main difference can be seen as > follow: My case > 1) set a boolean value inside Query class to true: collectTerms(true); (this is just a minor implementation suggestion) I think perhaps this flag could be passed to Query when executing query, not stored in Query object? This because it's not really a property of Query object but property of execution of seach (whether to keep track of Terms so they can be requested from Query, or returned along with Search results). This would require changes to Query classes however. > 2) run the search > 3) the searcher (reader actually) will call the method "rewrite" or some > other methods, inside this method we check if the user want collect the > terms testing the public boolean collectTerms. This is to avoid consumption > of un-necessary memory by a user that doesn't need to collect the terms. 4) Makes sense. > terms are now in memory in different query classes, depends on the "user > query", e.g.: a boolean query of 2 multitermquery. so the user can collect > them the way he wants and use them. e.g.: i collect them and store in an > array of Clauses, someone may just want to put in an array. One problem I tried to solve was that user shouldn't have to know structure of Query classes (that's what visitor pattern in general solves), while still allowing access to some useful properties, such as optional/reqd/prohibited flag that's only available in BooleanClause, not in queries (iterator keeps track of those flags and allows them to be accessed as if they were properties of queries themselves). Note however that your method could be changed to do similar recursive traversal (if it doesn't already do that, I may have misunderstood your explanation?) for simple cases, so that caller wouldn't have to know the structure, if it only needs terms, not context (ie. need not know which Term came from which query; sometimes this is needed, esp. with phrase queries). > 1) run the search [you can also collect Terms before running the search if necessary, since in any case they are calculated twice like you point out] > 2) the searcher collect all the terms because it needs due to produce > rsearch results. 3) use your term collector to collect the terms. ATTN: > this will do something that has been done already by the searcher! so, i > think it is a waste of resources and time, and as result performance slows > down. Like I said above, while you are right that it does have overhead (computing terms twice), I'm not sure how significant that would be in general, compared to search, scoring etc. It would be good to do some simple tests to see if I'm wrong here and Term collection is actually big part of execution time. > I want underline that the time to put an object in an array and get it back > is still the same, the difference is call the reader twice instead of one. Yes, that is correct. > I am not sure how much is the difference between the two cases, but for > logic i think there has to be, even more when we dial with prefixquery or > rangequery (that's where mainly we need the collector actually!). It may > sounds weird, but i lost all the data on my pc, this monday, so i can't > compare them, also i have to implement my idea again.. Ack. Sorry to hear that. > Let me know what you think, > Ciao. One other thing I was thinking about was refactoring Range and Prefix queries to be MultiTermQuery - based. I think that should benefit both solutions. Plus, it seems to me that PhrasePrefixQuery perhaps should just be rewritten. It acts very different from other queries, requiring caller to expand terms when it's being built. It seems like it perhaps should work more like plain PrefixQuery, and do expansion only when being executed. Otherwise one has to build new Query for each search execution, if index has changed. -+ Tatu +- --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org