Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 51817 invoked from network); 8 Sep 2004 12:58:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 8 Sep 2004 12:58:06 -0000 Received: (qmail 97435 invoked by uid 500); 8 Sep 2004 12:57:15 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 97198 invoked by uid 500); 8 Sep 2004 12:57:11 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 97137 invoked by uid 99); 8 Sep 2004 12:57:10 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from [200.61.168.200] (HELO gandalf.xlnet.net.ar) (200.61.168.200) by apache.org (qpsmtpd/0.28) with ESMTP; Wed, 08 Sep 2004 05:57:07 -0700 Received: from exchange.xlnet.net.ar (10.0.0.3) by gandalf.xlnet.net.ar (NPlex 5.1.036) id 412BA5200004AFFB for lucene-user@jakarta.apache.org; Wed, 8 Sep 2004 09:59:11 -0300 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: MultiFieldQueryParser seems broken... Fix attached. Content-class: urn:content-classes:message Date: Wed, 8 Sep 2004 09:59:01 -0300 X-MimeOLE: Produced By Microsoft Exchange V6.5.6944.0 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: MultiFieldQueryParser seems broken... Fix attached. Thread-Index: AcSVQHfFcOYxVdNGQk6B2V3yLZB8bAAY1K9A From: "Wermus Fernando" To: "Lucene Users List" X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Bill, I don't receive any .java. Could you send it again? Thanks. -----Mensaje original----- De: Bill Janssen [mailto:janssen@parc.com]=20 Enviado el: Martes, 07 de Septiembre de 2004 10:06 p.m. Para: Lucene Users List CC: Ali Rouhi Asunto: MultiFieldQueryParser seems broken... Fix attached. Hi! I'm using Lucene for an application which has lots of fields/document, in which the users can specify in their config files what fields they wish to be included by default in a search. I'd been happily using MultiFieldQueryParser to do the searches, but the darn users started demanding more Google-like searches; that is, they want the search terms to be implicitly AND-ed instead of implicitly OR-ed. No problem, thinks I, I'll just set the "operator". Only to find this has no effect on MultiFieldQueryParser. Once I looked at the code, I find that MultiFieldQueryParser combines the clauses at the wrong level -- it combines them at the outermost level instead of the innermost level. This means that if you have two fields, "author" and "title", and the search string "cutting lucene", you'll get the final query (title:cutting title:lucene) (author:cutting author:lucene) If the search operator is "OR", this isn't a problem. But if it is, you have two problems. The first is that MultiFieldQueryParser seems to ignore the operator entirely. But even if it didn't, the second problem is that the query formed would be +(title:cutting title:lucene) +(author:cutting author:lucene) That is, if the word "Lucene" was in both the author field and the title field, the match would fit. This clearly isn't what the searcher intended. You can re-write MultiFieldQueryParser, as I've done in the example code which I append here. This little program allows you to run either my parser (-DSearchTest.QueryParser=3Dnew) or the old parser (-DSearchTest.QueryParser=3Dold). It allows you to use either OR (-DSearchTest.QueryDefaultOperator=3Dor) or AND (-DSearchTest.QueryDefaultOperator=3Dand) as the operator. And it allows you to pick your favorite set of default search terms (-DSearchTest.QueryDefaultFields=3Dauthor:title:body, for example). It takes one argument, a query string, and outputs the re-written query after running it through the query parser. So to evaluate the above query: % java -classpath /import/lucene/lucene-1.4.1.jar:. \ -DSearchTest.QueryDefaultFields=3D"title:author" \ -DSearchTest.QueryDefaultOperator=3DAND \ -DSearchTest.QueryParser=3Dold \ SearchTest "cutting lucene" query is (title:cutting title:lucene) (author:cutting author:lucene) % The class NewMultiFieldQueryParser does the combination at the inner level, using an override of "addClause", instead of the outer level. Note that it can't cover all cases (notably PhrasePrefixQuery, because that class has no access methods which allow one to introspect over it, and SpanQueries, because I don't understand them well enough :-). I post it here in advance of filing a formal bug report for early feedback. But it will show up in a bug report in the near future. Running the above query with the new parser gives: % java -classpath /import/lucene/lucene-1.4.1.jar:. \ -DSearchTest.QueryDefaultFields=3D"title:author" \ -DSearchTest.QueryDefaultOperator=3DAND \ -DSearchTest.QueryParser=3Dnew \ SearchTest "cutting lucene" query is +(title:cutting author:cutting) +(title:lucene author:lucene) % which I claim is what the user is expecting. In addition, the new class uses an API more similar to QueryParser, so that the user has less to learn when using it. The code in it could probably just be folded into QueryParser, in fact. Bill the code for SearchTest: import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.document.Document; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.FuzzyQuery; import org.apache.lucene.search.WildcardQuery; import org.apache.lucene.search.PrefixQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.RangeQuery; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.Hits; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.queryParser.MultiFieldQueryParser; import org.apache.lucene.queryParser.FastCharStream; import org.apache.lucene.queryParser.TokenMgrError; import org.apache.lucene.queryParser.ParseException; import java.io.File; import java.io.StringReader; import java.util.Date; import java.util.HashMap; import java.util.Iterator; import java.util.StringTokenizer; class SearchTest { static class NewMultiFieldQueryParser extends QueryParser { static private final String DEFAULT_FIELD =3D "%%"; private String[] fields =3D null; public NewMultiFieldQueryParser (String[] f, Analyzer a) { super(DEFAULT_FIELD, a); fields =3D f; } protected void addClause(java.util.Vector clauses, int conj, int mods, Query q) { /* System.err.println("addClause: new query is " + q.toString()); if (clauses.size() > 0) { System.err.println(" existing clauses are:"); for (int i =3D 0; i < clauses.size(); i++) System.err.println(" " + clauses.get(i).toString()); } */ if ((q instanceof TermQuery) && (((TermQuery)q).getTerm().field() =3D=3D DEFAULT_FIELD)) { String text =3D ((TermQuery)q).getTerm().text(); BooleanQuery q2 =3D new BooleanQuery(); for (int i =3D 0; i < fields.length; i++) { q2.add(new TermQuery(new Term(fields[i], text)), false, false); } q =3D q2; } else if ((q instanceof WildcardQuery) && (((WildcardQuery)q).getTerm().field() =3D=3D DEFAULT_FIELD)) { String text =3D ((WildcardQuery)q).getTerm().text(); BooleanQuery q2 =3D new BooleanQuery(); for (int i =3D 0; i < fields.length; i++) { q2.add(new WildcardQuery(new Term(fields[i], text)), false, false); } q =3D q2; } else if ((q instanceof FuzzyQuery) && (((FuzzyQuery)q).getTerm().field() =3D=3D DEFAULT_FIELD)) { String text =3D ((FuzzyQuery)q).getTerm().text(); BooleanQuery q2 =3D new BooleanQuery(); for (int i =3D 0; i < fields.length; i++) { q2.add(new FuzzyQuery(new Term(fields[i], text)), false, false); } q =3D q2; } else if ((q instanceof PrefixQuery) && (((PrefixQuery)q).getPrefix().field() =3D=3D DEFAULT_FIELD)) { String text =3D ((PrefixQuery)q).getPrefix().text(); BooleanQuery q2 =3D new BooleanQuery(); for (int i =3D 0; i < fields.length; i++) { q2.add(new PrefixQuery(new Term(fields[i], text)), false, false); } q =3D q2; } else if ((q instanceof RangeQuery) && (((RangeQuery)q).getField() =3D=3D DEFAULT_FIELD)) { BooleanQuery q2 =3D new BooleanQuery(); for (int i =3D 0; i < fields.length; i++) { RangeQuery q3 =3D new RangeQuery(new Term(fields[i], ((RangeQuery)q).getLowerTerm().text()), new Term(fields[i], ((RangeQuery)q).getUpperTerm().text()), =20 ((RangeQuery)q).isInclusive()); q2.add(q3, false, false); } q =3D q2; } else if ((q instanceof PhraseQuery) && (((PhraseQuery)q).getTerms()[0].field() =3D=3D DEFAULT_FIELD)) { BooleanQuery q2 =3D new BooleanQuery(); Term[] terms =3D ((PhraseQuery)q).getTerms(); for (int i =3D 0; i < fields.length; i++) { PhraseQuery q3 =3D new PhraseQuery(); for (int j =3D 0; j < terms.length; j++) { q3.add(new Term(fields[i], terms[j].text())); } q2.add(q3, false, false); } q =3D q2; } super.addClause(clauses, conj, mods, q); } } private static void search (String querystring) { StringBuffer query_buffer =3D new StringBuffer(); String[] query_default_fields =3D new String[] { "title", "authors", "contents" }; int search_operator =3D QueryParser.DEFAULT_OPERATOR_AND; String query_parser =3D "new"; Query query =3D null; try { StandardAnalyzer analyzer =3D new StandardAnalyzer(); String z =3D System.getProperties().getProperty("SearchTest.QueryDefaultFields"); if (z !=3D null) { query_default_fields =3D z.split(":"); } z =3D System.getProperties().getProperty("SearchTest.QueryDefaultOperator"); if (z !=3D null) { if (z.equalsIgnoreCase("or")) search_operator =3D QueryParser.DEFAULT_OPERATOR_OR; else if (z.equalsIgnoreCase("and")) search_operator =3D = QueryParser.DEFAULT_OPERATOR_AND; } z =3D System.getProperties().getProperty("SearchTest.QueryParser"); if (z !=3D null) { if (z.equalsIgnoreCase("new")) query_parser =3D "new"; else if (z.equalsIgnoreCase("old")) query_parser =3D "old"; } // form the query query_buffer.append(querystring); =20 // run the query if (query_parser.equals("new")) { NewMultiFieldQueryParser p =3D new NewMultiFieldQueryParser(query_default_fields, analyzer); p.setOperator(search_operator); query =3D p.parse(query_buffer.toString()); } else { MultiFieldQueryParser p =3D new MultiFieldQueryParser(query_default_fields[0], analyzer); p.setOperator(search_operator); query =3D p.parse(query_buffer.toString(), query_default_fields, analyzer); } System.err.println("query is " + query.toString()); =20 } catch (ParseException e) { System.out.println("* Invalid search expression '" + query_buffer.toString() + "' specified"); System.err.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage()); } catch (Exception e) { System.out.println("* Lucene search engine raised " + e.getClass() + " with message " + e.getMessage()); System.err.println(" 'search' caught a " + e.getClass() + "\n with message: " + e.getMessage()); e.printStackTrace(System.err); } } private static void usage () { // print usage message to stderr System.err.println("Usage: SearchTest 'QUERY'"); } public static void main(String[] args) { if (args.length !=3D 1) { usage(); return; } search (args[0]); } } --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org