Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of erickerickson@gmail.com
 designates 74.125.44.29 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:references;
        b=dr8kuzwxxu1XSKY58QZVFKiDKpBOFG5Jbhk5iKY6gdyScvCsRb1Buop45hYdy6Swk/
         D4QKi7XLGDMDYf3Oj0hlKOanMLypBPtXXnw7dIsTGPLFczUodS4w5iT6OCadCIMXXMk9
         ixtNkeTpk41sqBSHKtdjAlFt6udl5+nzc82vA=
Message-ID: <359a92830812220554m79e2e5s9636949c405375ba@mail.gmail.com>
Date: Mon, 22 Dec 2008 08:54:46 -0500
From: "Erick Erickson" <erickerickson@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: BooleanQuery Performance Help
In-Reply-To: <494F0A00.3090103@tachyontech.net>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_70043_25513824.1229954086889"
References: <494CFFEF.50206@tachyontech.net>
	 <359a92830812200730j576c2df8p4a434c71a1d173c2@mail.gmail.com>
	 <494F0A00.3090103@tachyontech.net>

------=_Part_70043_25513824.1229954086889
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Well, you haven't run afoul of the usual suspects, that's pretty
clean timing code. I'm afraid I'll have to defer to the people who
know the internals of Lucene.....

Best
Erick

On Sun, Dec 21, 2008 at 10:31 PM, Prafulla Kiran
<prafulla@tachyontech.net>wrote:

> Hi,
>
> Here's the code which I am using to time the query:
>
> long startTime = System.currentTimeMillis();
> TopDocCollector collector = new TopDocCollector(10);
> is.search(query,collector);
> ScoreDoc[] hits = collector.topDocs().scoreDocs;
> long endTime = System.currentTimeMillis();
>
> Most of the clauses which I removed, had very few unique terms: say 2 or 3.
> I have started taking the timings after I've fired the warmup queries.
> Also, I am not doing any kind of sorting or iterating through the hits
> object.
>
> Regards,
> Prafulla
>
> Erick Erickson wrote:
>
>> What specifically are you measuring when you time the queries? I've been
>> mislead by including in my measurement say, creating the response. I
>> realize
>> that throughput includes assembling the response, but the solution is
>> different
>> depending upon whether it's the actual search or what you do with the
>> results that takes the time.
>>
>> Are you doing any sorting?
>>
>> Are you using a Hits object and iterating on it? This gets very
>> inefficient.
>>
>> You might post your code where you time the query. Also what do the "few
>> specific clauses" you remove look like? Do they have anything to do with
>> time?
>> How many unique values do the fields have that you remove to see the
>> improvement?
>>
>> Do you start your timings *after* you've fired up a few warmup queries?
>>
>> Best
>> Erick
>>
>> On Sat, Dec 20, 2008 at 9:23 AM, Prafulla Kiran <prafulla@tachyontech.net
>> >wrote:
>>
>>
>>
>>> Hi Everyone,
>>>
>>> I have an index of relatively small size (400mb) , containing roughly 0.7
>>> million documents. The index is actually a copy of an existing database
>>> table. Hence, most of my queries are of the form
>>>
>>> " +field1:value1 +field2:value2 +field3:value3..... ~20 fields"
>>>
>>> I have been running performance tests using this query. Strangely, I
>>> noticed that if I remove some specific clauses... I get a performance
>>> improvement of atleast 5 times. Here are the numbers and examples, so
>>> that I
>>> could be more precise
>>>
>>> 1) Complete Query: 90 requests per second using 10 threads
>>> 2) If I remove few specific clauses : 500 requests per second using 10
>>> threads
>>> 3) If I form a new query using only 2 clauses from the set of removed
>>> clauses -> 100 requests per second using 10 threads
>>>
>>> Now, some of these specific clauses are such that they match around half
>>> of
>>> the entire document set.  Also, note that I need all the query terms to
>>> be
>>> present in the documents retrieved. My target is to obtain 300 requests
>>> per
>>> second with the given query (20 clauses). It includes 2 range queries.
>>> However, I am unable to get 300 rps unless I remove some of the clauses
>>> (which include these range queries) .
>>> I have tried using filters without any significant improvement in
>>> performance. Also, I have more than enough RAM, so I am using the
>>> RAMDirectory to read the index. I have optimized my index before
>>> searching.
>>> All the tests have been warmed for 5 seconds ( the test duration is 10
>>> seconds).
>>>
>>> My first question is, is this kind of decrease in performance expected as
>>> the number of clauses shoot up ? Using a single clause out of these 20 ,
>>> I
>>> was able to get 2000 requests per second!
>>> Could someone please guide me if there are any other ways in which I can
>>> obtain improvement in performance ?
>>> Particularly, I am interested to know more about what further caching
>>> could
>>> be done apart from the default caching which lucene does.
>>>
>>> Thanks In Advance,
>>> Prafulla
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>  ------------------------------------------------------------------------
>>
>>
>> No virus found in this incoming message.
>> Checked by AVG - http://www.avg.com Version: 8.0.176 / Virus Database:
>> 270.9.19/1857 - Release Date: 12/19/2008 10:09 AM
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

------=_Part_70043_25513824.1229954086889--