lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruce Altner <balt...@hq.nasa.gov>
Subject RE: MS Word Search ??
Date Thu, 30 May 2002 19:44:19 GMT
This is a good lead but it prompts me to ask this: if tools like openoffice 
and others (like Acrobat Distiller) know how to reformat Excel, PowerPoint, 
and Word it means that the data formats of these files, as streams, must be 
public knowledge. If so, where do you get this information? I would use it 
to try to build my own parser with JavaCC if I knew what bytes in the the 
input stream I needed to tokenize.

Obviously it's not that easy, otherwise we wouldn't need projects like POI. 
But is this correct, at least conceptually?

Disclaimer: I'm new to JavaCC but the topic interests me so I'm willing to 
put in the time to learn it.

Bruce

At 10:50 AM 5/30/2002 -0500, you wrote:
>This might be worth looking into for those who need to parse word, excel,
>powerpoint, or other MS file types of microsofts.
>
>openoffice - www.openoffice.org knows how to parse all of the microsoft
>formats (at least all that I've tried so far) - and then, you can a do a
>save as, and write out the open office format, which is a couple of xml
>files zipped together.  So, this makes me think of two possible ways that
>you could get at the content of the MS files in a text form you can index
>(neither of which I have tried or even looked to see if they are possible)
>
>#1 - get the code for openoffice - it is open source - and use it for
>parsing the MS documents into xml which could then be indexed
>
>#2 - if open office is programmatically drivable (which I don't know if it
>is), fire up a copy of open office and use it to convert the files as
>necessary.
>
>Just some suggestions.  Does anyone know much more about openoffice?  I
>would be interested in knowing if either of these would be feasible.
>
>Dan
>
>
>
>
>-----Original Message-----
>From: Ewout Prangsma [mailto:e.prangsma@daisysoftware.com]
>Sent: Wednesday, May 29, 2002 1:00 PM
>To: Lucene Users List
>Subject: Re: MS Word Search ??
>
>
>Op Wednesday 29 May 2002 11:56, Karl Øie schreef:
> > b: convert the documents to something that is accessable through java like
> > xml, etc...
>
>We're using wvWare (wvware.com) to convert word to html (or text) and index
>that and xpdf for converting PDF to text and index that. Any links on
>indexing using POI converters (or other java converters) are very welcome!
>
>Ewout
>
> >
> > the best way is to convert as the java api's for MSOffice documents still
> > are under development
> >
> > mvh karl øie
> >
> > On Wednesday 29 May 2002 11:48, Rama Krishna wrote:
> > > Hi,
> > >
> > > I am trying to build a search engine which search in MS Word, excel, ppt
> > > and adobe pdf. I am not sure whether i can use Lucene for this or not.
> > > pl. help me out in this regard.
> > >
> > >
> > > Regards,
> > > Ramakrishna
> > >
> > >
> > > _________________________________________________________________
> > > Chat with friends online, try MSN Messenger: http://messenger.msn.com
>
>--
>Ewout Prangsma, Directeur
>Daisy Software
>Telefoon/fax: +31-77-3270305/3270306
>Email: e.prangsma@daisysoftware.com
>Website: www.daisysoftware.com
>KvK Venlo nr. 12046144
>
>
>
>
>--
>To unsubscribe, e-mail:
><mailto:lucene-user-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail:
><mailto:lucene-user-help@jakarta.apache.org>
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>



--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message