lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Becker <>
Subject Re: Spanish analyzer and Indexing StarOffice docs
Date Mon, 21 Jul 2003 22:31:49 GMT
Hi Oscar,

we have been looking into the StarOffice/OpenOffice problem, although we 
haven't done it and probably won't anytime soon as we have to move on to 
other things. I see two approaches, both with variants:

(1) use the fact that it is just zipped XML: use a ZipInputStream to 
open the files and parse the XML contents. You can use the standard 
approaches for parsing XML or you can tweak it. It might be worthwhile 
to look only at the contents part for the body indexing and try to be 
smart about the metadata information while ignoring the layout bits. 
Should be relatively easy since all these parts are in separate files.

(2) use the UDK ( Drawbacks: even though it 
seems documented its sheer size will make it a bit hard to get into. You 
will also have to deploy a large library which is not pure Java. 
Advantages: you will not only get the SO/OOo documents as good as the 
programs parse them themself, but also everything they can import. And 
that will be way better than anything we could get so far for Word 
documents. A UDK-based document parser would most likely be the killer 
for enterprise document indexing -- you wouldn't need much more if 
anything at all.

We might actually still go for (1) since that is really easy, but we 
don't have the time for (2). Although we'd love to have it, so if you go 
for it tell us :-)


Oscar Herrera wrote:

>Hi. I'd like to know if some of you could help me finding a spanish analyzer (free if
possible). I'd also like to know how can I index a file made on StarOffice 5.x (.sdw and .sdx
files), I've been looking on google for them but I have not found anything about this, 
>Thank you in advance for your collaboration,
>Oscar Herrera
>Bogotá, Colombia, SA.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message