lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vitaly Funstein <vfunst...@gmail.com>
Subject Re: Difference in behaviour between LowerCaseFilter and String.toLowerCase()
Date Mon, 03 Dec 2012 23:09:25 GMT
If you don't need to support case-sensitive search in your application,
then you may be able to get away with adding string fields to your
documents twice - lowercase version for indexing only, and verbatim to
store. For example (this is Lucene 4 code, but same idea),

    // indexed - not stored
doc.add(new Field(fieldName, value.toLowerCase(),
StringField.TYPE_NOT_STORED));

    // stored - not indexed
doc.add(new Field(fieldName, value, StoredField.TYPE));

Of course, to preserve symmetry for search, you would also need to force
string terms in your queries to lower case as well.

On Sat, Dec 1, 2012 at 1:02 AM, Dawid Weiss <dawid.weiss@gmail.com> wrote:

> Iterating character-by-character is different than considering the
> entire string at once so your observation is correct, that's how it's
> supposed to work. In particular, note this in String#toLowerCase
> documentation:
>
> "Since case mappings are not always 1:1 char mappings, the resulting
> String may be a different length than the original String."
>
> So it simply cannot be the same as iterating char-by-char.
>
> Dawid
>
> On Sat, Dec 1, 2012 at 6:32 AM, Trejkaz <trejkaz@trypticon.org> wrote:
> > On Fri, Nov 30, 2012 at 8:22 PM, Ian Lea <ian.lea@gmail.com> wrote:
> >> Sounds like a side effect of possibly different, locale-dependent,
> >> results of using String.toLowerCase() and/or Character.toLowerCase().
> >>
> >>
> http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#toLowerCase()
> >> specifically mentions Turkish.
> >>
> >> A Google search for "Character.toLowerCase() turkish" gets hits which
> >> sound relevant.
> >
> > Certainly Turkish has special rules because of that uppercase I with
> > dot. I was more wondering whether LowerCaseFilter was intentionally
> > doing it differently to String.toLowerCase() or whether it was some
> > kind of unintentional side-effect of using Character.toLowerCase()
> > iteratively.
> >
> > TX
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message