lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kudrettin Güleryüz <kudret...@gmail.com>
Subject Re: utf-8 issues depending on host
Date Tue, 23 May 2017 21:41:14 GMT
Thank you for the explanation and the tool.

On Tue, May 23, 2017 at 4:07 PM Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi,
>
> FileReader is a broken class, this is well-known. For that reason it is
> part of the forbidden-apis lis, which is also used by Lucene to prevent
> issues like your in our source code. To correctly specify the characterset
> for reading a file, you have to use an FileInputStream and wrap it with an
> InputStreamReader. On the InputStreamReader you can give the charset.
>
> See https://github.com/policeman-tools/forbidden-apis
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Kudrettin Güleryüz [mailto:kudrettin@gmail.com]
> > Sent: Tuesday, May 23, 2017 9:13 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: utf-8 issues depending on host
> >
> > I create the object as new FileReader(file)
> > Where file is read from File.listFiles() as below:
> > cwd.listFiles(getSourceCodeFilter())
> > File file : files
> >
> > FileReader doesn't seem to have a constructor that lets me specify an
> > encoding, and in fact I feel like I should not be setting it to UTF-8 by
> > default, anyways.
> >
> > Let me revise my question, how can I make sure all hosts running this
> > indexer code behave as expected? It certainly runs as expected on one
> > machine while not on others. One that runs as expected is Debian 8.3
> others
> > are Debian 7.4.
> >
> > Thank you
> >
> > On Tue, May 23, 2017 at 10:45 AM Adrien Grand <jpountz@gmail.com>
> > wrote:
> >
> > > The issue is likely due to how you create the FileReader that you pass
> to
> > > TextField. Maybe you don't give it the right encoding?
> > >
> > > Le mar. 23 mai 2017 à 16:38, Kudrettin Güleryüz <kudrettin@gmail.com>
> a
> > > écrit :
> > >
> > > > Hi,
> > > >
> > > > Depending on the host running indexer, UTF-8 characters are not
> stored
> > > (not
> > > > correctly, anyways) in Lucene index.
> > > >
> > > > Interestingly, locale output is identical on all hosts but the
> output is
> > > > different.
> > > >
> > > > Apparently using FileReader could be the culprit.  I am currently
> using
> > > > TextField(String name, Reader reader)
> > > >
> > > > How can I improve this? What is the suggested way for handling this
> using
> > > > 5.2.1? TextField(String name, String value, Store store)?
> > > >
> > > > Thank you
> > > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message