lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Krenek <richard.kre...@gmail.com>
Subject Re: Weird time results doing wildcard queries
Date Thu, 08 Sep 2005 23:05:18 GMT
I did the change and here are the results:

Query (default field is COMP_PART_NUMBER): 2444*
Query: COMP_PART_NUMBER:2444*
Query Time: 328 ms - time for query to run.
383 total matching documents.
Cycle Time: 141 ms - time to run through hits.


Query (default field is COMP_PART_NUMBER): *91822*
Query: COMP_PART_NUMBER:*91822*
Query Time: 9375 ms
251 total matching documents.
Cycle Time: 20094 ms

On 9/8/05, Chris Hostetter <hossman_lucene@fucit.org> wrote:
> 
> : is if the query starts with a wildcard. In the case where it starts with 
> a
> : wildcard, lucene has no option but to linearly go over every term in the
> : index to see if it matches your pattern. It must visit every singe term 
> in
> 
> That would explain why the search itself takes a while, but not why
> accessing the hits after the call to search would take a while. note
> where the timing code is in his example.
> 
> There are two possible explanations i can think of...
> 
> : >>It seems when I have a wilcard query like *abcd* vs weqrew*, the 
> *abcd*
> : query will always take longer to retrieve the documents even if they are 
> of
> : simular result sizes. We are talking a big difference 1 second vs 16. It 
> is
> 
> 1) How similar, and how many? ... If i remember correctly, the Hits
> constructor does some work to pre-fetch the first 100 results. So if you
> are iterating over all of the results, the first 100 are free. On the
> 101st iteration the prefetching method is called again to fetch N more (i
> don't remember what N is off the top of my head.
> 
> what this means is that if you are only timing the method calls on Hits,
> then the first 100 documents are free -- if one wildcard search returns 99
> results, and the other returns 105 results, those numbers may not seemthat
> different, but in the first case the code you are timing is accessing
> nothing but memory, and in the second case it has to read from disk.
> 
> 2) The second idea also requires you to answer a question" the number of
> results returned for each query might be identicle, but are the
> results themselves identical?
> 
> I'm guessing that either the documents from the "slow" case are either
> much bigger (ie: larger stored fields) or the results from the fast case
> are all documents that are "near" eachother on disk, so fetching back all
> of hte stored fields would require less IO then if the results are stored
> farther apart. If i remember correctly, the stored fields of documents
> are kept in order that the documents are added, so hypothetically, the
> query you did was on a "name" field, and the documents were added to the
> index in alphabetical order by "name" then by definition the results for
> "weqrew*' will all be close together, while the results for "*abcd*" will
> be spread out throughout the index.
> 
> an easy way to disprove that 2nd theory would be to change your timing
> code to this and see what happens...
> 
> 
> Hits hits = searcher.search(query);
> long startTime = System.currentTimeMillis();
> for (int i = 0; i < hits.length(); i++) {
> int id = hits.id(i);
> }
> 
> 
> 
> -Hoss
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message