lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oscar Herrera" <>
Subject Re: Spanish analyzer and Indexing StarOffice docs
Date Tue, 22 Jul 2003 13:54:01 GMT

First than everything thank you everybody for your collaboration. I've been
looking on StarOffice 5.x documents and trying to unzip them but I've not
been successful.

If I'm not wrong, data is only being saved as xml files on versions greater
than 5.2 of StarOffice (OpenOffice and StarOffice 6, please correct me if
this isn't true), so I'd like to know if there is already any way to index
files of version 5.2 and below, or any clue on how to do this,

Thank you in advance,

Oscar Herrera
Bogotá, Colombia, SA.

----- Original Message -----
From: "Peter Becker" <>
To: "Lucene Users List" <>
Sent: Monday, July 21, 2003 5:31 PM
Subject: Re: Spanish analyzer and Indexing StarOffice docs

> Hi Oscar,
> we have been looking into the StarOffice/OpenOffice problem, although we
> haven't done it and probably won't anytime soon as we have to move on to
> other things. I see two approaches, both with variants:
> (1) use the fact that it is just zipped XML: use a ZipInputStream to
> open the files and parse the XML contents. You can use the standard
> approaches for parsing XML or you can tweak it. It might be worthwhile
> to look only at the contents part for the body indexing and try to be
> smart about the metadata information while ignoring the layout bits.
> Should be relatively easy since all these parts are in separate files.
> (2) use the UDK ( Drawbacks: even though it
> seems documented its sheer size will make it a bit hard to get into. You
> will also have to deploy a large library which is not pure Java.
> Advantages: you will not only get the SO/OOo documents as good as the
> programs parse them themself, but also everything they can import. And
> that will be way better than anything we could get so far for Word
> documents. A UDK-based document parser would most likely be the killer
> for enterprise document indexing -- you wouldn't need much more if
> anything at all.
> We might actually still go for (1) since that is really easy, but we
> don't have the time for (2). Although we'd love to have it, so if you go
> for it tell us :-)
> HTH,
>    Peter
> Oscar Herrera wrote:
> >Hi. I'd like to know if some of you could help me finding a spanish
analyzer (free if possible). I'd also like to know how can I index a file
made on StarOffice 5.x (.sdw and .sdx files), I've been looking on google
for them but I have not found anything about this,
> >
> >Thank you in advance for your collaboration,
> >
> >Oscar Herrera
> >Bogotá, Colombia, SA.
> >
> >
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message