lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Haskin <De...@HaskinFerguson.net>
Subject Re: Searching Microsoft Word , Excel and PPT files for Japanese
Date Thu, 20 May 2004 17:36:17 GMT
I really don't know much about this, but I would disagree with Will's 
assumption that because non-ascii show up as entities when you do "Save 
as HTML" that implies that they are stored *internally* within 
Word/Excel/Ppt files a entities.  That just seems unlikely to me.

Maybe somebody over at the Poi project [1] knows something about this?

My 2 cents...

dwh

[1] http://jakarta.apache.org/poi/index.html



>-----Original Message-----
>From: wallen@Cyveillance.com [mailto:wallen@Cyveillance.com] 
>Sent: Thursday, May 20, 2004 10:43 PM
>To: lucene-user@jakarta.apache.org
>Subject: RE: Searching Microsoft Word , Excel and PPT files for Japanese
>
>I believe MS apps store non-ascii characters as entities internally instead
>of using unicode.  You can see evidence of this if you save your file as an
>HTML file and look at the source.  You will have to adjust your parser to
>convert the Windows-1252 characters/entities to unicode (UTF-8 or UTF-16).
>
>-Will
>  
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message