Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E1C91D7F7 for ; Sun, 17 Mar 2013 02:36:28 +0000 (UTC) Received: (qmail 79213 invoked by uid 500); 17 Mar 2013 02:36:26 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 79160 invoked by uid 500); 17 Mar 2013 02:36:26 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 79141 invoked by uid 99); 17 Mar 2013 02:36:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 17 Mar 2013 02:36:25 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of linlma@gmail.com designates 209.85.216.43 as permitted sender) Received: from [209.85.216.43] (HELO mail-qa0-f43.google.com) (209.85.216.43) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 17 Mar 2013 02:36:20 +0000 Received: by mail-qa0-f43.google.com with SMTP id dx4so1061407qab.9 for ; Sat, 16 Mar 2013 19:36:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=JZfkPQBQj3AQaXrNWYGDVZV5IuIC30qmnC0i7BNtP/U=; b=LnQwmDeIHDSQO2fp3QjpDFpWhnEIX1hQqOa7JlGET3ugBlyvRjMFVfhzo0ntYWbVrq wDfqsOrREnxZ7SdrhTtJjjDdkthbaPdd8U+m7fyCrHSZgVxpwwMQtXcaiqIBsGbPVARp SvrWBCP2jtWVs/pmoh+krARWMnCkzNU9H4c9MPk9eI5excJ1WT100f4k4LGUJ6vOx+5o 1YXZQhJF3etBEc2nhweudNhGAOqwsFC3oWVzjeDB7ysfAeWA5ftsvfKeI3ZGqHZZKVFI 4MTFixYxRhQBwoOFeglkgflI4aYVZUr40v11zMhSshz8pShqwu5+4nr7KijqfzxlthtA aRfQ== MIME-Version: 1.0 X-Received: by 10.224.188.13 with SMTP id cy13mr13212705qab.53.1363487759997; Sat, 16 Mar 2013 19:35:59 -0700 (PDT) Received: by 10.49.120.226 with HTTP; Sat, 16 Mar 2013 19:35:59 -0700 (PDT) In-Reply-To: References: Date: Sun, 17 Mar 2013 10:35:59 +0800 Message-ID: Subject: Re: potential query performance issue From: Lin Ma To: lukai , java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=485b397dd3b59720b804d815bb25 X-Virus-Checked: Checked by ClamAV on apache.org --485b397dd3b59720b804d815bb25 Content-Type: text/plain; charset=ISO-8859-1 Thanks Lukai for the detailed reply, - "If you query is too long, it might not very efficient in query evaluation process. " -- how does Lucene query evaluation works? Is there any document to refer to? - "you can read out payload of the match term you have stored" -- what do you mean payload of the match term? Could you show me an example? regards, Lin On Sun, Mar 17, 2013 at 7:13 AM, lukai wrote: > > > On Fri, Mar 15, 2013 at 10:02 PM, Lin Ma wrote: > >> Hi Lukai, thanks for the detailed reply. >> >> Some more comments, >> >> - "You can try score by payload" -- what do you mean score by >> payload? Appreciate if you could provide a bit more details; >> >> Write your own query/scorer, you can read out payload of the match > term you have stored. You can implement your dot product functionality in > score function of your scorer. > >> >> - "Lucene focus on search for the default implementation" -- for >> default you mean? >> >> I mean the default query parser, query types are designed for search > application. If you query is too long, it might not very efficient in query > evaluation process. > >> >> - "For your requirement, you can do some query re-write process to >> reduce your query size" -- I think query re-write you mean rewrite "iPhone >> 5", "iPhone 4S" to "iPhone" to reduce # of queries? Or you mean something >> else? >> >> Query re-write, it really depends on your application. you can > reduce/expand your query or even change the query type according your > needs. > >> >> - >> >> regards, >> Lin >> >> >> On Sat, Mar 16, 2013 at 11:55 AM, lukai wrote: >> >>> Different application has different requirement and resolve different >>> problem. Lucene focus on search for the default implementation. For your >>> requirement, you can do some query re-write process to reduce your query >>> size if you still want to leverage the search functionality. If you just >>> want to customize your feature value and do simple dot product calculation. >>> You can try score by payload, it might not very efficient, cuz you still >>> need to convert your query into some specified Lucene query type. But you >>> still can leverage the existing index structure, NRT, distributed search >>> support by Solr. >>> >>> When you refer to performance, it really depends on the document size, >>> term distribution of your corpus. If you have enough machine, you can just >>> try reduce document number per instance and distribute your search to >>> achieve a better performance goal. >>> >>> >>> >>> >>> On Fri, Mar 15, 2013 at 7:36 PM, Lin Ma wrote: >>> >>>> Hi lukai, thanks for the reply. Do you mean WAND is a way to resolve >>>> this issue? For "native support", do you mean there is no built-in >>>> (existing ready to use externally open source) module in Lucene to >>>> implement WAND? If so, the performance will really be bad. >>>> >>>> regards, >>>> Lin >>>> >>>> >>>> On Sat, Mar 16, 2013 at 2:49 AM, lukai wrote: >>>> >>>>> I had implemented wand with solr/lucene. So far there is no performance >>>>> issue. There is no native support for this functionality, you need to >>>>> implement it by yourself.. >>>>> >>>>> On Fri, Mar 15, 2013 at 10:09 AM, Lin Ma wrote: >>>>> >>>>> > Hello guys, >>>>> > >>>>> > Supposing I have one million documents, and each document has >>>>> hundreds of >>>>> > features. For a given query, it also has hundreds of features. I >>>>> want to >>>>> > fetch most relevant top 1000 documents by dot product related >>>>> features of >>>>> > query and documents (query/document features are in the same feature >>>>> > space). >>>>> > >>>>> > I am not sure how Lucene implement internally? If we have to go >>>>> through all >>>>> > one million document to dot product the query, then I am concerning >>>>> about >>>>> > the performance. Appreciate if anyone could confirm (1) how Lucene >>>>> works >>>>> > internally for this use case (2) any smart ideas to make improvement >>>>> for >>>>> > query efficiency to select top 1000 documents? >>>>> > >>>>> > thanks in advance, >>>>> > Lin >>>>> > >>>>> >>>> >>>> >>> >> > --485b397dd3b59720b804d815bb25--