lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Diviacco <patrick.divia...@gmail.com>
Subject file formats: MacRoman and UTF-8...
Date Mon, 28 Mar 2011 07:03:50 GMT
When I run my Lucene app and a parse a xml file I get the following error
due to some fonts such as "é" written in the text file.

If I save the text file as UTF-8 with my text editor I don't have this
issue, but when I create it with a java app, it is saved as MacRoman.

How can I specify a different format with Java instead ?

thanks

[CODE]Exception in thread "main"
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:
Invalid byte 1 of 1-byte UTF-8 sequence.
at
com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:684)
at
com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:554)
at
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1742)
at
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1416)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2792)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
at org.apache.commons.digester.Digester.parse(Digester.java:1871)
at CollectionIndexer.main(CollectionIndexer.java:111)[/CODE]

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message