From lucene-dev-return-3285-qmlist-jakarta-archive-lucene-dev=nagoya.apache.org@jakarta.apache.org Thu Mar 20 01:46:54 2003 Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 7788 invoked from network); 20 Mar 2003 01:46:53 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 20 Mar 2003 01:46:53 -0000 Received: (qmail 29635 invoked by uid 97); 20 Mar 2003 01:48:44 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@nagoya.betaversion.org Received: (qmail 29628 invoked from network); 20 Mar 2003 01:48:44 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 20 Mar 2003 01:48:44 -0000 Received: (qmail 7570 invoked by uid 500); 20 Mar 2003 01:46:50 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 7559 invoked from network); 20 Mar 2003 01:46:50 -0000 Received: from www3.mail.lycos.com (HELO mailcity.com) (209.202.220.160) by daedalus.apache.org with SMTP; 20 Mar 2003 01:46:50 -0000 Received: from Unknown/Local ([?.?.?.?]) by mailcity.com; Thu, 20 Mar 2003 01:46:43 -0000 To: lucene-dev@jakarta.apache.org Date: Wed, 19 Mar 2003 17:46:43 -0800 From: "none none" Message-ID: Mime-Version: 1.0 Reply-To: korfut@lycos.com X-Sent-Mail: off X-Mailer: MailCity Service Subject: Re: Iterators for collecting Terms from Queries X-Priority: 3 X-Sender-Ip: 65.95.141.94 Organization: Lycos Mail (http://www.mail.lycos.com:80) Content-Type: multipart/mixed; boundary="=_-=_-KDOJHEPBDIMBAEAA" Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N --=_-=_-KDOJHEPBDIMBAEAA Content-Language: en Content-Type: text/plain; charset=us-ascii Content-Language: en Content-Length: 4855 Content-Transfer-Encoding: 7bit >Also, I started thinking that perhaps combining parts of two approaches would >make lots of sense, improving performance of my solution, and generalizing >your solution a bit? (ie. there'd be more support from core Lucene for >implementing highlighters) > >I think having a term query collector (and matching iterator) makes sense. >This way all Queries could be easily collected, along with some flags that >BooleanClause has (optional etc). This is fairly easy to do, and doesn't have >too many performance problems. Plus, caller need then not worry about actual >Query tree structure, even if new Queries are added, it's Query's >responsibility to add that one traversal method implementation. >I also don't think this adds too much clutter to general code base. > >However, after queries are collected, it would be possible to access collected >Terms using method you implemented, ie. having a method to access Terms >collected during query execution. Caller also can choose to do additional >query type dependant handling if/as necessary at this point (to access slop >amongst other things?) >So essentially one could traverse all Queries easily, and for each one ask for >all the actual terms, without having to worry about exact query type, unless >it wants to. > >Now, for some extra convenience, it would be easy to add simple iterators over >actual terms. Since method for accessing collected Terms would be in base >class, there would be no need to have half a dozen or more iterator classes I >had to add to encapsulate collection process. But that would be optional >thing to have. > >Finally, a method similar to accessing collected actual terms, but for >accessing base term(s) would be useful. Since there can be up to 2 base terms >(for range query), I'm not sure of method signature, but implementation >should be easy to add (perhaps use signature similar to many JDK API methods, >where an optional Collection is passed, into which store Term(s); if null is >passed, a new Collection like ArrayList is created and returned). > >Does this make sense? I am not 100% sure, but i think so, could you give me an example even pseudo-code? > >That could be, for big data sets, and prefix/wildcard queries that have lots >of terms. > >Fortunately highlighting is only done for single documents at a time >(usually?). It is, but what that has to do with a test? we can still run a test and see the difference in collecting terms, my suggestion was that. May be i didn't explain myself properly, english is my second language btw. >Another way around the problem is to start from highlighted document, and >build a (temporary) index, and actually execute query against just this >single dummy (RAMDirectory based) index (that contains only terms from that >one doc to be highlighter). It would be interesting to see if this might be >more efficient way to find actual matched terms. > Do you mean, index just one document and use the search itself to highligh it? it could work, especially in a pool of thread, but i believe it will be too much IO file handler etc. Or did you mean something else? >I agree, generic (actual) term access/collecting method should be available >from any Query (and actually same for base terms). > Take a look at the code and tell me what do you think. >Yes, I just happened to notice it in search package, didn't know such a thing >existed as query parser has (currently?) no way to use it. :-) > >Of course, having PhrasePrefixQuery, one wonders if it'd make sense to >have PhraseWildcardQuery as well. :-) >(don't think implementing that would be any more difficult than prefix one, >but both may be fairly inefficient in some cases) Yes and no, the purpose of optimize your solution is just when we have big amount of data, run a query like that would be slow and not useful becuase it will retrive potentially a lot of documents IF we run just by itself, but if there is another clause in the query it could be very useful, so that 2nd or 3rd clause will bring down our number of search results and the wildcardphrasequery will make the difference, a nice one! > >Thanks for your ideas and suggestions, Thanks to you you too! Attached there is zip file that contains my prototype of collector, i know it can be optimized and that it reflect my needs (see SlopeClause) but it is a good point to start, also the constructor with the boolean to skip the term collector is not there because i always collect them, it could be added easly. Take a look and tell me what do you think. Ciao _____________________________________________________________ Get 25MB, POP3, Spam Filtering with LYCOS MAIL PLUS for $19.95/year. http://login.mail.lycos.com/brandPage.shtml?pageId=plus&ref=lmtplus --=_-=_-KDOJHEPBDIMBAEAA Content-Type: application/zip; name="collector.zip" Content-Length: 3330 Content-Transfer-Encoding: base64 UEsDBBQAAAAIABqjcy6vQsgIoAMAAKENAAAQAAAAQ29yZUNoYW5nZXMuamF2Ye1WS28bNxA+ 24D/w+QUSxB222uDwkgC9JQCaWqgh8KH0e5Iy5giFZKrtVLkv3dmyF27tmS4zQNFURm2V3zN fI/hbD0/O4U5vO7QrSnCBluC5CF1BG/6hhzBax9oAVtsrnFN4MO6Qv7SUWV1voqEoenOTuf1 2Wn+mY8fgF96CvvqPe4QYBqey6Kap99R6oPTYInCBjAE3C9g08cESwKz2VrakEvUgnFAHBVi v6UAjcUYQTLnoLDtl9Y0gMuYAjYJLvms369gTUme4vnsxYHM3gngY+nVdUscI9DZ6TaYHSaC l5LbG8OZSaoRfgRHw+1oCVLX2LYC7i62ywkbNN4lNC4CWjvhjoxu5+0uw0ydiZBJXegS41q6 4TmlyPmkgwEHHajukfAAO0/9wb8AISd0nlfMNHCVvCI4Fyw68d2VAIFPE5icEgekIRjmYUOp 8y28zxpZP+jsylt+NG4N1rBn/IqRtiRxPyjFfNB5+rCAFdpI5d/sBdQ1SAg2nC6TkAp/3JyT 1M38pBvK+lu/MLQ5vEVvPfy6xRZXPiDMDygu+I4KftyNS3b+gKGVrDZbTGZprEl7GEzqdLHn P8WSFEc9SqICJjPGTjqqEAt0UobvmkuHvr9iPU50mMXhmbJRDXdSRNVp/v7pAOzfjG0bzv8L e30EOGo/wWTvv+QZzCSWuySfo4xy1izYNvhEjVT2khrsI9s8AZOFsDIObaZzAbHzvW3V9Xwh +B2FYFpq9YxaEh5P2XnTipeUGi24VIi946FZpuiANt++YB93w98p1kOiv/LeErrjdn/c1iPY bG2tTVFyg9ek41KfLEjivCCmvaV2AXzz9rxvP14YDduCtfFBZHXE0mmnYalb7iLQ8FrZtuxT VVWi/mA4Km9e9YydYOjIwUDPeVAvm3JeEUDS9H2Ca8fWGzpMJUuWKqFrip34ZA0ak1FExl3L HYVL2corfpg9UQjXW3uE6p/6jx/3X6u44njhxqcU1heoq88oq39NFeUq+aeVdEfan3ubjMwf k/fQPZI1urynUSFi6tYTw9Nbyz2WpQuX9vikXjuWXenPpV3zIY824OnzoBNDacUSg7+N2QkW 2Teul2TNw6ZpFE8cGWj6EPg9bqpOXdCgE/Oh22dXZsBJapluEjne/lcBYCGhTIrPwQ9uDNWo k/kgpckzI5nqi4sLyMV6KxnA2y5gPPreJwk4D01+G372TC+IB0cEWpmbr/Pq+B/rSv8b+LMN /CdQSwMEFAAAAAgAjaNzLlmjFSEUBQAA+A4AAA4AAABDb2xsZWN0b3IuamF2Ya2X3VPcNhDA n48Z/ocNT74L8UHzVD7apJRM2kkCKcz0geFBZ++dFXSWK8kcB+F/7+rDPvk+gHRqBn9Iq93V 7k8r3XCwvfXufW0KqQ7gnEkh4aJiORtLxajnd2bwAN4O938e/rS393Z7a3vri7RNWk4ReI5M A1MIht1gCWMlp/CZ8Ru4yAqFfISKBGuVIWQyRzt6MLT3imU3bIIg1SRl9FFgKuoMS0wNqmkm hcDMSHVoZfm0ksrAN3bLUi7TwWG3qTZcxI2rKlnJxFxz/bQUL3O8e1pEI1NZ4WT8X1WPBM9A G2bokQmmNZw0zm9vPWxv9SrFbymGi+akT62259Fq6A0HA6AHDBoJYEKAKRAuKRIaKA/AYMJv Kbxfa1TzIP2uYopN4R/X1L3sYC9qJOCdUYy0Gq+NEtQZz8tM1DmeK1nwETeY03iNxo40qkb7 DCJObbWQswp3odY1+TuHMRManbMFnxSC/g0vJ1DVqpIag0mFplYlnI2+0TyvriOXmVJsDjNu inb6wWFZlzkNt9hYwW7EW02BGBeyxM/dRWYXRlIKZOXqRPs2kD1TKDnT8MfZ6V2GleGy3IWP zQxQta0hZ7331tFPXDcBPYYSZ9C2Jv1D76e9JhgcCq64EbtrPDm0mkNwnFBqpFNJ6np0eZWe F/tOzLimAZz67GofNRpKtzv6lOMuNKmjKXTOOAmPELxFSiUvQcgZqjcZ05h61cHAKmZryQqa PV7xQB8ld31kurggtGYFUsEIowNIYbwuZC1y61tVG3KLrCSnAqdYGn0AF0YRU/2ugYjII1tj frHYHg3da+zlzkJwp8mEkdIrc3B10RpzqhpwK3m+yGMHrCUSFqQtDHlXYZUx2/7gO/kYPB80 XTJdZki5+82rcvb6DVCNGx8oyLFAknTEG/+CV1UHM6sH7Upda/a8UATAZqtRf5LEwh2bDmdn ZDiEc4VjfvfVd//NRZ4xlYfPD/X9/Ty8/8XKCYZ3a8+9rvPhcy0MbyWWltbyZN2SoeXSrBNX WE4EqzV6WGmVxMGz/SWti6xWmhYP1bWsKUc+WL7EhxXsFg71OeRtFQ8kNSV/M0krKew4sQEw H401lD0NWFDtZ02l0mlvYnDsP1NyLDS1FYxWH3C309kvW9kTTvJ7h8DhqKMlFVhOTEEdr18H bh6a1FnOoiX6/Tu8iode8es0ngkRs6gBX84uYefVDvxq13EuDygOFAxu0k0lNlL63CJ4DHiE etoB5LLdKgmPCPNQeOKS8aJMx8smet+Q537IXc8mQAtZYZylC2pwKerZN/Qz9mLh3e9IUa8X j0TSRqt7Ol2uslq6XfyubZt1v8HlMux1rRs+6E5xGASNYafqajEs0OF0eoosWEskxZINR37D zdqc0pBg+ZPdrE4ojsliMAk4dx6X5tq4d9zMzgr5TZbleRKJxiXjKSRCQZB0U6EsmHnV7Js/ jsdSRfu/a8CLE/kfkvODDK6F8M2+61pGsIvTvsOnhWHvhTA8T8JmFHqPXRz+rCkLEzTt4dSd WYCVuT88gT08tdXpeQ7ChFdmcelVq2lTCGzGo8Nhas0mfToiLsZ00R1CXk+ndmcvrVtuy6JJ jWvRPaCTmDvgE0a39Esu6rBiNnvVPKpXuxC0c8gJYZtnU/DwC6WzS44wcxhwmh25wAS/t+e8 IjZBRjN0eynIEmGK9As0d7EL5zDvd0wYzRvLXIdTf4N3EA+gXMfJDV1tIfVRojtF6l9QSwEC FAAUAAAACAAao3Mur0LICKADAAChDQAAEAAAAAAAAAABACAAtoEAAAAAQ29yZUNoYW5nZXMu amF2YVBLAQIUABQAAAAIAI2jcy5ZoxUhFAUAAPgOAAAOAAAAAAAAAAEAIAC2gc4DAABDb2xs ZWN0b3IuamF2YVBLBQYAAAAAAgACAHoAAAAOCQAAAAA= --=_-=_-KDOJHEPBDIMBAEAA Content-Type: text/plain; charset=us-ascii --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org --=_-=_-KDOJHEPBDIMBAEAA--