Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 30416 invoked from network); 11 Jun 2010 13:35:42 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 11 Jun 2010 13:35:42 -0000 Received: (qmail 50912 invoked by uid 500); 11 Jun 2010 13:35:39 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 50854 invoked by uid 500); 11 Jun 2010 13:35:38 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 50846 invoked by uid 99); 11 Jun 2010 13:35:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Jun 2010 13:35:38 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [130.83.156.232] (HELO lnx503.hrz.tu-darmstadt.de) (130.83.156.232) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Jun 2010 13:35:29 +0000 Received: from [130.83.60.20] (dialin20.stud.tu-darmstadt.de [130.83.60.20]) by lnx503.hrz.tu-darmstadt.de (8.14.4/8.14.4/HRZ/PMX) with ESMTP id o5BDZ8mh002508 for ; Fri, 11 Jun 2010 15:35:08 +0200 (envelope-from bruch@cs.tu-darmstadt.de) Message-ID: <4C123B8B.7010908@cs.tu-darmstadt.de> Date: Fri, 11 Jun 2010 15:35:07 +0200 From: Marcel Bruch User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Using Lucene with a rather simplistic scoring system? Content-Type: multipart/mixed; boundary="------------070107000505000200010605" X-PMX-TU: seen v0.99a by 5.5.9.395186, Antispam-Engine: 2.7.2.376379, Antispam-Data: 2010.6.11.133018 X-PMX-RELAY: mailout X-Virus-Checked: Checked by ClamAV on apache.org --------------070107000505000200010605 Content-Type: multipart/alternative; boundary="------------040003030305000301060106" --------------040003030305000301060106 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Hi! We are working on an experimental code-search engine that helps users to find example code snippets based on what a developer already typed inside her editor. Our "homemade search engine" produces some cool results but its performance is somehow limited :-) Thus, we are evaluating whether Lucene can solve our performance issues. However, we are not familiar with Lucene and I wonder if some of you could help me to learn whether Lucene fits our problem well. Thanks in advance for your comments. The situation is as follows. For each source code file we extract some code properties like which types are used inside the code, which methods are overridden or which methods are called inside a method body etc. For each source code file we get a JSON structure similar to this: { "class" : my.ExampleClass "extends" : the.SuperClass "overrides" : - the.SuperClass.method1() - the.SuperClass.method2() "used types": - a.Type1 - a.Type2 - ... "used methods": - a.Type1.method32() - a.Type1.method23() - ... } The scoring function we use is rather simplistic. Given a query (which looks somehow identical to the document above) we determine for each feature (i.e. "used methods", "used types", "overrides" etc.) a simple matching strategy: the percentage of overlap between each query-document feature and db-document feature. Then we simply multiply each feature-score f_i with an individual feature-weight w_i and sum it all up into one overall score. My questions are: Is it meaningful to use Lucene here in this setup- or put different - can I implement that scoring scheme with Lucene easily? How would such a solution look like? By just subclassing Scorer? Many thanks in advance for advice All the best, Marcel --------------040003030305000301060106 Content-Type: text/html; charset=ISO-8859-15 Content-Transfer-Encoding: 8bit Hi!

We are working on an experimental code-search engine that helps users to find example code snippets based on what a developer already typed inside her editor. Our “homemade search engine” produces some cool results but its performance is somehow limited :-) Thus, we are evaluating whether Lucene can solve our performance issues. However, we are not familiar with Lucene and I wonder if some of you could help me to learn whether Lucene fits our problem well. Thanks in advance for your comments.

The situation is as follows. For each source code file we extract some code properties like which types are used inside the code, which methods are overridden or which methods are called inside a method body etc. For each source code file we get a JSON structure similar to this:
{
��� “class” : my.ExampleClass
��� “extends” : the.SuperClass
��� “overrides” :
������� - the.SuperClass.method1()
������� - the.SuperClass.method2()
��� “used types”:
������� - a.Type1
������� - a.Type2
������� -�� ...
��� “used methods”:
������� - a.Type1.method32()
������� - a.Type1.method23()
������� - ...
<few more things>
}
The scoring function we use is rather simplistic. Given a query (which looks somehow identical to the document above) we determine for each feature (i.e. “used methods”, “used types”, “overrides” etc.) a simple matching strategy: the percentage of overlap between each query-document feature and db-document feature. Then we simply multiply each feature-score f_i with an individual feature-weight w_i and sum it all up into one overall score.�

My questions are: Is it meaningful to use Lucene here in this setup- or put different - can I implement that scoring scheme with Lucene easily?� How would such a solution look like? By just subclassing Scorer?

Many thanks in advance for advice

All the best,
Marcel

--------------040003030305000301060106-- --------------070107000505000200010605 Content-Type: text/plain; charset=us-ascii --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org --------------070107000505000200010605--