lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <>
Subject few ideas
Date Wed, 07 Jan 2004 07:26:49 GMT

the upgrade to latest Lucene was smooth. Great job, guys!
I realized though few things, that would be nice to be


MultiFieldQueryParser abandones any changes, that are made
to the instance and it doesn't give a way, how to propagate 
them to the used QueryParser instance. For example there
is no way to set operator to DEFAULT_OPERATOR_AND for QueryParser,
that is instantiated in static method parse().

The second problem is quite old. Field.Text(String, Reader)
doesn't store content, but Field.Text(String, String) does.
It is very unusual and confusing, when two methods with same
name have such different behaviour. I realized that after year
and my index decreased from 20+ MB to 5+ MB.

The third problem is in queryParser gramar. Plus and minus
characters are keywords, that might appear at the start
of search phrase. If user enters string, which contains
them, ParseException is thrown. My users often search for
C++, DC++ etc. They must escape it with \, but it is too
complex for many of them. 

Could grammar be rewritten the way, that + - are keywords 
ONLY at START of the phrase?

Fourth (and main) issu:

I have two fields (among others): title and content.
Title contains few words, content from nothing to hundreds
of kilobytes of text. I wanted to prioritize documents,
that hit title field. I wanted them to be boosted.

My first implementation added title string to content field
and search was performed over content. For example:
title: Compaq iPaq
content: Compaq iPaq To run Linux on iPaq you must do this
search query: ipaq Linux

Because this query matches title too, I wanted to give him
higher score. Unfortunatelly there is no way to boost selected
text within one field. So I decided to perform search on both
fields and used MultiFieldQueryParser. Well, because of this
DEFAULT_OPERATOR_AND issue I wrote my own implementation 

public static BooleanQuery parse(String query, Analyzer analyzer)
throws ParseException {
  BooleanQuery bQuery = new BooleanQuery();
  for ( int i = 0; i<fields.length; i++ ) {
    QueryParser queryParser = new QueryParser(fields[i], analyzer);
    Query q = queryParser.parse(query);
    bQuery.add(q, false, false);
  return bQuery;

which worked, as I expected - the matches within title have higher
(I even during index phase boosted title field). The problem is,
that the score is too high for this case and too low for the second
case. E.g. if title is matched, the score ranges between 70%-100%
and if title is not matched, the score is below 8%.

hardware  Compaq FS940  	        100%
hardware  COMPAQ / Silitek SK2865 USB 	85%
diskuse   compaq evo a mandrake 9.1 	84%
hardware  Compaq Deskpro 2000 	        83%
hardware  Compaq SmartArray 2 	        81%
hardware  IBM PD-1 (Matsushita LF-1195	8%
diskuse   Trvajici problem      	7%
diskuse   ovladac tlan.o 	        7%
diskuse   ovlada─Ź modemu 	        6%
diskuse   Grafika pod VMWare 	        6%

After few time of thinking I realized, there is no
way, how to fix this difference. I expect that
score is computed as balance between two BooleanClauses
within BooleanQuery. If one has high score (because
the text was short) and the second has small score
(because there was lot of text), the result score
is the bigger one - which penalizes field with large texts.

I could try to boost second field during index phase 
or during search, but this cannot work well in situation,
where the length of field rnages wildly from words to
pages of text.

Am I wrong? Do you see some way, how to do this?
I can only imagine one big change in API:

class TaggedText {
  void addString(String data);
  void addString(String data, float boost);
  List<TaggedText.Part> getParts();

class Field {
  static Field UnStored(String field, TaggedText data);
  ... etc.

What do you think about this?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message