lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hannah c" <hanna...@hotmail.com>
Subject RE: AW: Problem indexing Spanish Characters
Date Wed, 19 May 2004 16:34:43 GMT
Hi,

I had a quick look at the sandbox but my problem is that I don't need a 
spanish stemmer. However there must be a replacement tokenizer that supports 
foreign characters to go along with the foreign language snowball stemmers. 
Does anyone know where I could find one?

In answer to Peters question -yes I'm also using "UTF-8" encoded XML 
documents as the source.
I also put below an example of what is happening when I tokenize the text 
using the StandardTokenizer below.

Thanks Hannah



------------------text I'm trying to index

century palace known as la “Fundación Hospital de Na. Seńora del Pilar”

-----------------tokens outputed from StandardTokenizer

century
palace
known
as
la
â
FundaciĂ    *
n               *
Hospital
de
Na
SeĂ          *
ora           *
del
Pilar
â
-----------------------



>From: "Peter M Cipollone" <lu1@bihvhar.com>
>To: <hannahc7@hotmail.com>
>Subject: Re: Problem indexing Spanish Characters
>Date: Wed, 19 May 2004 11:41:28 -0400
>
>could you send some sample text that causes this to happen?
>
>----- Original Message -----
>From: "Hannah c" <hannahc7@hotmail.com>
>To: <lucene-user@jakarta.apache.org>
>Sent: Wednesday, May 19, 2004 11:30 AM
>Subject: Problem indexing Spanish Characters
>
>
> >
> > Hi,
> >
> > I  am indexing a number of English articles on Spanish resorts. As such
> > there are a number of spanish characters throught the text, most of 
>these
> > are in the place names which are the type of words I would like to use 
>as
> > queries. My problem is with the StandardTokenizer class which cuts the
>word
> > into two when it comes across any of the spanish characters. I had a 
>look
>at
> > the source but the code was generated by JavaCC and so is not very
>readable.
> > I was wondering if there was a way around this problem or which area of
>the
> > code I would need to change to avoid this.
> >
> > Thanks
> > Hannah Cumming
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >
>




>From: PEP AD Server Administrator 
><PEPADServer.Administrator@erl9.siemens.de>
>Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
>To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
>Subject: AW: Problem indexing Spanish Characters
>Date: Wed, 19 May 2004 18:08:56 +0200
>
>Hi Hannah, Otis
>I cannot help but I have excatly the same problems with special german
>charcters. I used snowball analyser but this does not help because the
>problem (tokenizing) appears before the analyser comes into action.
>I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
>some minutes ago which describes my problem and Hannahs seem to be similar.
>Do you have also UTF-8 encoded pages?
>
>Peter MH
>
>-----Ursprüngliche Nachricht-----
>Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
>Gesendet: Mittwoch, 19. Mai 2004 17:42
>An: Lucene Users List
>Betreff: Re: Problem indexing Spanish Characters
>
>
>It looks like Snowball project supports Spanish:
>http://www.google.com/search?q=snowball spanish
>
>If it does, take a look at Lucene Sandbox.  There is a project that
>allows you to use Snowball analyzers with Lucene.
>
>Otis
>
>
>--- Hannah c <hannahc7@hotmail.com> wrote:
> >
> > Hi,
> >
> > I  am indexing a number of English articles on Spanish resorts. As
> > such
> > there are a number of spanish characters throught the text, most of
> > these
> > are in the place names which are the type of words I would like to
> > use as
> > queries. My problem is with the StandardTokenizer class which cuts
> > the word
> > into two when it comes across any of the spanish characters. I had a
> > look at
> > the source but the code was generated by JavaCC and so is not very
> > readable.
> > I was wondering if there was a way around this problem or which area
> > of the
> > code I would need to change to avoid this.
> >
> > Thanks
> > Hannah Cumming
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


--------------------------------------------------------------------------------------------------------------------------------Hannah

Cumming
hannahc7@hotmail.com



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message