lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <>
Subject Re: Using Lucene with a rather simplistic scoring system?
Date Fri, 11 Jun 2010 13:51:55 GMT
Others can comment on how to customize scoring, but I wonder if
lucene's default scoring might do the job as is.

If you've got a document in the index (simple translation from your JSON)

class: my.ExampleClass
extends: the.SuperClass
overrides: the.SuperClass.method1() the.SuperClass.method2()
used types: a.Type1 a.Type2
used methods: a.Type1.method32() a.Type1.method23()

then sample queries

"extends: the.SuperClass"
"extends: the.SuperClass overrides: the.SuperClass.method1()"

would both match, but the second one should score higher because it
matches more terms.  The weighting could be done by boosts e.g. if you
care more about overrides

"extends: the.SuperClass overrides: the.SuperClass.method1()^2"

Whatever you do you'll need to play with analyzers if need to keep the
dots and brackets and case-sensitivity.  And will need to make sure
you've got the right must/should/and/or logic in place.

Good luck.  Sounds like an interesting project.


On Fri, Jun 11, 2010 at 2:35 PM, Marcel Bruch <> wrote:
> Hi!
> We are working on an experimental code-search engine that helps users to
> find example code snippets based on what a developer already typed inside
> her editor. Our “homemade search engine” produces some cool results but its
> performance is somehow limited :-) Thus, we are evaluating whether Lucene
> can solve our performance issues. However, we are not familiar with Lucene
> and I wonder if some of you could help me to learn whether Lucene fits our
> problem well. Thanks in advance for your comments.
> The situation is as follows. For each source code file we extract some code
> properties like which types are used inside the code, which methods are
> overridden or which methods are called inside a method body etc. For each
> source code file we get a JSON structure similar to this:
> {
>     “class” : my.ExampleClass
>     “extends” : the.SuperClass
>     “overrides” :
>         - the.SuperClass.method1()
>         - the.SuperClass.method2()
>     “used types”:
>         - a.Type1
>         - a.Type2
>         -   ...
>     “used methods”:
>         - a.Type1.method32()
>         - a.Type1.method23()
>         - ...
> <few more things>
> }
> The scoring function we use is rather simplistic. Given a query (which looks
> somehow identical to the document above) we determine for each feature (i.e.
> “used methods”, “used types”, “overrides” etc.) a simple matching strategy:
> the percentage of overlap between each query-document feature and
> db-document feature. Then we simply multiply each feature-score f_i with an
> individual feature-weight w_i and sum it all up into one overall score.
> My questions are: Is it meaningful to use Lucene here in this setup- or put
> different - can I implement that scoring scheme with Lucene easily?  How
> would such a solution look like? By just subclassing Scorer?
> Many thanks in advance for advice
> All the best,
> Marcel
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message