lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Becker <pbec...@dstc.edu.au>
Subject Re: Spanish analyzer and Indexing StarOffice docs
Date Tue, 22 Jul 2003 22:11:47 GMT
Hello Oscar,

yes, you are right -- XML was introduced for 6.0. Sorry -- I confused 
the numbers.

The only reasonable way to go about this that I can see right now is 
going the option (2) I proposed -- things might have advanced with the 
UDK. It most likely won't be easy, but at least if running OpenOffice as 
application during indexing is an option you should be able to use its 
import filters.

I guess converting everything first via SO6/OOo is not an option. 
Otherwise you might want to look into File->Auto Pilot->Document 
conversion... of these applications.

HTH,
  Peter


Oscar Herrera wrote:

>Hi,
>
>First than everything thank you everybody for your collaboration. I've been
>looking on StarOffice 5.x documents and trying to unzip them but I've not
>been successful.
>
>If I'm not wrong, data is only being saved as xml files on versions greater
>than 5.2 of StarOffice (OpenOffice and StarOffice 6, please correct me if
>this isn't true), so I'd like to know if there is already any way to index
>files of version 5.2 and below, or any clue on how to do this,
>
>Thank you in advance,
>
>Oscar Herrera
>Bogotá, Colombia, SA.
>
>
>----- Original Message -----
>From: "Peter Becker" <pbecker@dstc.edu.au>
>To: "Lucene Users List" <lucene-user@jakarta.apache.org>
>Sent: Monday, July 21, 2003 5:31 PM
>Subject: Re: Spanish analyzer and Indexing StarOffice docs
>
>
>  
>
>>Hi Oscar,
>>
>>we have been looking into the StarOffice/OpenOffice problem, although we
>>haven't done it and probably won't anytime soon as we have to move on to
>>other things. I see two approaches, both with variants:
>>
>>(1) use the fact that it is just zipped XML: use a ZipInputStream to
>>open the files and parse the XML contents. You can use the standard
>>approaches for parsing XML or you can tweak it. It might be worthwhile
>>to look only at the contents part for the body indexing and try to be
>>smart about the metadata information while ignoring the layout bits.
>>Should be relatively easy since all these parts are in separate files.
>>
>>(2) use the UDK (http://udk.openoffice.org/). Drawbacks: even though it
>>seems documented its sheer size will make it a bit hard to get into. You
>>will also have to deploy a large library which is not pure Java.
>>Advantages: you will not only get the SO/OOo documents as good as the
>>programs parse them themself, but also everything they can import. And
>>that will be way better than anything we could get so far for Word
>>documents. A UDK-based document parser would most likely be the killer
>>for enterprise document indexing -- you wouldn't need much more if
>>anything at all.
>>
>>We might actually still go for (1) since that is really easy, but we
>>don't have the time for (2). Although we'd love to have it, so if you go
>>for it tell us :-)
>>
>>HTH,
>>   Peter
>>
>>
>>
>>
>>Oscar Herrera wrote:
>>
>>    
>>
>>>Hi. I'd like to know if some of you could help me finding a spanish
>>>      
>>>
>analyzer (free if possible). I'd also like to know how can I index a file
>made on StarOffice 5.x (.sdw and .sdx files), I've been looking on google
>for them but I have not found anything about this,
>  
>
>>>Thank you in advance for your collaboration,
>>>
>>>Oscar Herrera
>>>Bogotá, Colombia, SA.
>>>
>>>
>>>
>>>      
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message