lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Barbarelli" <mbarbare...@gmail.com>
Subject Re: Customizing Stop Word List?
Date Fri, 13 Jul 2007 09:38:41 GMT
Here's the sample code. Incidentally, this is in C#. I am using Lucene.NET,
but I am assuming this problem could be universal to all versions and that
this is a question that is best exposed to the collective wisdom of the Java
user group.

default list of ISO country codes.
*

public string[] DEFAULT_STOP_WORDS = { "a", "and", "are", "as", "at", "be",
"but", "by", "for", "if", "in", "into", "is", "no", "not", "of", "on", "or",
"s", "such", "t", "that", "the", "their", "then", "there", "these", "they",
"this", "to", "was", "will", "with", "inc","incorporated","co.","ltd","ltd."
};
*

create array containing stop words, but where ISO country code equivalents
are omitted.
*

public string[] MY_STOP_WORDS = { "a", "and", "are", "as", "but", "by",
"for", "if", "in", "into", "is", "no", "not", "of", "on", "or", "s", "such",
"t", "that", "the", "their", "then", "there", "these", "they", "this", "to",
"was", "will", "with", "inc", "incorporated", "co.", "ltd", "ltd." };
*

Next, create query and submit to search. Provide MY_STOP_WORDS array as
parameter to the standard analyzer.
*

Query query = QueryParser.Parse(strQuery, "company_name", new
StandardAnalyzer(MY_STOP_WORDS));

Hits hits = searcher.Search(query);
*

Note that the default field for the query object is company name. However,
multi-field queries will be submitted to the query object in the variable
"strQuery".

For example,
*

+(company_name:widgets ^10~ international ^5~ incorporated~ )
+(country_iso:US)
*

There is a bit of logic elsewhere in my application that constructs this
syntax based on field names and values submitted via the UI. However, if one
of those country code values is "AT", "BE", "IT", "IN", etc; then the query
logic is erroneously constructed as the following. Note that the country
code is missing.


*

+(company_name:belgium ^10~ telecom ^5~ ) +(country_iso:)
*



Note that the country ISO field is null. If a query is sumbitted to the
search object in this way, then I receive the following exception at
runtime.
*

Lucene.Net.QueryParsers.ParseException was unhandled by user code

Message="Encountered \")\" at line 1, column 60.\r\nWas expecting one
of:\r\n \"(\" ...\r\n <QUOTED> ...\r\n <TERM> ...\r\n <PREFIXTERM> ...\r\n
<WILDTERM> ...\r\n \"[\" ...\r\n \"{\" ...\r\n <NUMBER> ...\r\n "

Source="Lucene.Net"

StackTrace:

at Lucene.Net.QueryParsers.QueryParser.jj_consume_token(Int32 kind)

at Lucene.Net.QueryParsers.QueryParser.Clause(String field)

at Lucene.Net.QueryParsers.QueryParser.Query(String field)
*



And finally, here is how I am creating my index:


*

doc.Add(Field.Keyword("rec_id", entity_id.Trim()));

doc.Add(Field.Text("aaa", ob10_account_id.Trim()));

doc.Add(Field.Text("company_name", entity_name.Trim()));

doc.Add(Field.Text("VAT_reg", VAT_reg.Trim()));

doc.Add(Field.Text("account_type_description",
account_type_description.Trim()));

doc.Add(Field.Text("account_type", account_type.Trim()));

doc.Add(Field.Text("add_line1", add_line1.Trim()));

doc.Add(Field.Text("add_line2", add_line2.Trim()));

doc.Add(Field.Text("add_line3", add_line3.Trim()));

doc.Add(Field.Text("add_line4", add_line4.Trim()));

doc.Add(Field.Text("add_line5", add_line5.Trim()));

doc.Add(Field.Text("add_line6", add_line6.Trim()));

doc.Add(Field.Keyword("country_iso", country_iso.Trim()));

doc.Add(Field.Text("country_name", country_name.Trim()));

doc.Add(Field.Text("entity_status_desc", entity_status_desc.Trim()));

doc.Add(Field.Text("acct_status_desc", acct_status_desc.Trim()));

doc.Add(Field.Text("firstname", firstname.Trim()));

doc.Add(Field.Text("lastname", lastname.Trim()));



writer.AddDocument(doc);
*



Have I submitted my custom stop words incorrectly? Should I somehow use a
per-field analyzer for the country_ISO field? If so, which?

Thanks so much in advance for your help.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message