lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tomi NA" <hef...@gmail.com>
Subject Re: accented characters, wildcards and other problems
Date Fri, 14 Jul 2006 16:13:28 GMT
On 7/13/06, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
> Bok Tomi,
>
> What do you mean by "terms are misrepresented"?  What should they be, and what are you
seeing?

I mean 3/5 accented characters appear in the index with accents
correctly displayed, but the remaining 2 accented characters appear as
characters I don't know how to pronounce or what they're called -
somewhere along the line some kind of encoding/decoding process
mistakenly assumes the data is encoded in a certain way.
Update: I've managed to solve the problem localy (when I index a test
directory with accented characters on my ext3 partition), but when I
try indexing a directory I access via a samba mount, I'm stuck with
the old problem again. Could be the iocharset, although there are 2
other encoding related settings which might cause the problem.

> > What I'm not clear on is how can I see the problematic *terms* in the list of terms,
but not the documents they're stored in?
>
> Are you saying that the content got indexed, but the file names did not?

I'm saying that I expect to see a list of indexed documents in the
"documents" list, and I don't see the documents containing the
problematic accented characters. However, I see the terms with the
problematic accented characters, although they are missrepresented.

> Out of curiosity (note my last name), I'm curious about what analyzer/tokenizer you're
using.  Is there an equivallent of Porter stemmer for Croatian?  I could use that. :)

I'm very new to the technology, so I'm using whatever nutch is using
by default. As far as the stemmer's concerned, I'd say that wildcards
go a long way in providing the necessairy functionality, probably even
better than automatic stemming. However, I apreciate the fact that
most users' minds don't come with an inbuilt regexp constructor. :)
As far as Croatian is concerned, a stemming database was developed
just recently (by "completed" I mean "a usable language coverage") at
the department of Croatian language studies (for want of a better
word)...the problem is, however, it's not publicly available. You see,
when I pay my taxes out of which their salaries are paid, it doesn't
seem to obligate them produce value to me as their indirect invester.
But that's something I'd like to say to their faces, with just a tad
more feeling. ;)

t.n.a.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message