lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Barbarelli" <mbarbare...@gmail.com>
Subject Re: Customizing Stop Word List?
Date Fri, 13 Jul 2007 15:16:53 GMT
Please disregard previous request for assistance.  I've fixed the bug I was
struggling with and it actually had nothing to do with the analyzer in
question.

Thanks very much.


On 7/13/07, Michael Barbarelli <mbarbarelli@gmail.com> wrote:
>
> Here's the sample code. Incidentally, this is in C#. I am using Lucene.NET,
> but I am assuming this problem could be universal to all versions and that
> this is a question that is best exposed to the collective wisdom of the Java
> user group.
>
> default list of ISO country codes.
> *
>
> public string[] DEFAULT_STOP_WORDS = { "a", "and", "are", "as", "at",
> "be", "but", "by", "for", "if", "in", "into", "is", "no", "not", "of", "on",
> "or", "s", "such", "t", "that", "the", "their", "then", "there", "these",
> "they", "this", "to", "was", "will", "with",
> "inc","incorporated","co.","ltd","ltd." };
> *
>
> create array containing stop words, but where ISO country code equivalents
> are omitted.
> *
>
> public string[] MY_STOP_WORDS = { "a", "and" , "are", "as" , "but", "by" ,
> "for", "if" , "in", "into" , "is", "no" , "not", "of" , "on", "or" , "s",
> "such" , "t", "that" , "the", "their" , "then", "there" , "these", "they",
> "this", "to" , "was", "will" , "with", "inc" , "incorporated", "co." ,
> "ltd", "ltd." };*
>
> Next, create query and submit to search. Provide MY_STOP_WORDS array as
> parameter to the standard analyzer.
> *
>
> Query
> query = QueryParser.Parse(strQuery, "company_name", new StandardAnalyzer
> (MY_STOP_WORDS));
>
> Hits
> hits = searcher.Search(query);*
>
> Note that the default field for the query object is company name. However,
> multi-field queries will be submitted to the query object in the variable
> "strQuery".
>
> For example,
> *
>
> +(company_name:widgets ^10~ international ^5~ incorporated~ )
> +(country_iso:US)
> *
>
> There is a bit of logic elsewhere in my application that constructs this
> syntax based on field names and values submitted via the UI. However, if one
> of those country code values is "AT", "BE", "IT", "IN", etc; then the query
> logic is erroneously constructed as the following. Note that the country
> code is missing.
>
>
> *
>
> +(company_name:belgium ^10~ telecom ^5~ ) +(country_iso:)
> *
>
>
>
> Note that the country ISO field is null. If a query is sumbitted to the
> search object in this way, then I receive the following exception at
> runtime.
> *
>
> Lucene.Net.QueryParsers.ParseException was unhandled by user code
>
> Message="Encountered \")\" at line 1, column 60.\r\nWas expecting one
> of:\r\n \"(\" ...\r\n <QUOTED> ...\r\n <TERM> ...\r\n <PREFIXTERM>
...\r\n
> <WILDTERM> ...\r\n \"[\" ...\r\n \"{\" ...\r\n <NUMBER> ...\r\n "
>
> Source="Lucene.Net"
>
> StackTrace:
>
> at Lucene.Net.QueryParsers.QueryParser.jj_consume_token(Int32 kind)
>
> at Lucene.Net.QueryParsers.QueryParser.Clause(String field)
>
> at Lucene.Net.QueryParsers.QueryParser.Query(String field)
> *
>
>
>
> And finally, here is how I am creating my index:
>
>
> *
>
> doc.Add(
> Field.Keyword("rec_id" , entity_id.Trim()));
>
> doc.Add(
> Field.Text("aaa" , ob10_account_id.Trim()));
>
> doc.Add(
> Field.Text("company_name" , entity_name.Trim()));
>
> doc.Add(
> Field.Text("VAT_reg" , VAT_reg.Trim()));
>
> doc.Add(
> Field.Text("account_type_description" , account_type_description.Trim()));
>
>
> doc.Add(
> Field.Text("account_type" , account_type.Trim()));
>
> doc.Add(
> Field.Text("add_line1" , add_line1.Trim()));
>
> doc.Add(
> Field.Text("add_line2" , add_line2.Trim()));
>
> doc.Add(
> Field.Text("add_line3" , add_line3.Trim()));
>
> doc.Add(
> Field.Text("add_line4" , add_line4.Trim()));
>
> doc.Add(
> Field.Text("add_line5" , add_line5.Trim()));
>
> doc.Add(
> Field.Text("add_line6" , add_line6.Trim()));
>
> doc.Add(
> Field.Keyword("country_iso" , country_iso.Trim()));
>
> doc.Add(
> Field.Text("country_name" , country_name.Trim()));
>
> doc.Add(
> Field.Text("entity_status_desc" , entity_status_desc.Trim()));
>
> doc.Add(
> Field.Text("acct_status_desc" , acct_status_desc.Trim()));
>
> doc.Add(
> Field.Text("firstname" , firstname.Trim()));
>
> doc.Add(
> Field.Text("lastname" , lastname.Trim()));
>
>
>
> writer.AddDocument(doc);
> *
>
>
>
> Have I submitted my custom stop words incorrectly? Should I somehow use a
> per-field analyzer for the country_ISO field? If so, which?
>
> Thanks so much in advance for your help.
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message