lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew C. Oliver" <acoli...@apache.org>
Subject Re: MS Word Search ??
Date Thu, 30 May 2002 20:46:42 GMT
Bruce Altner wrote:

> This is a good lead but it prompts me to ask this: if tools like 
> openoffice and others (like Acrobat Distiller) know how to reformat 
> Excel, PowerPoint, and Word it means that the data formats of these 
> files, as streams, must be public knowledge. If so, where do you get 
> this information? I would use it to try to build my own parser with 
> JavaCC if I knew what bytes in the the input stream I needed to tokenize.

I'm not sure what JavaCC has to do with word files.

>
> Obviously it's not that easy, otherwise we wouldn't need projects like 
> POI. But is this correct, at least conceptually?

POI is a bit more ambitious than that.  We look to create full APIs for 
read/write and modify, not just convert to ASCII.

>
> Disclaimer: I'm new to JavaCC but the topic interests me so I'm 
> willing to put in the time to learn it.

It would be easier to take Ryan's early code for POI::HDF and create a 
parser from it.  Once HDF is in Beta or so I plan to do so.

-Andy

>
> Bruce
>
> At 10:50 AM 5/30/2002 -0500, you wrote:
>
>> This might be worth looking into for those who need to parse word, 
>> excel,
>> powerpoint, or other MS file types of microsofts.
>>
>> openoffice - www.openoffice.org knows how to parse all of the microsoft
>> formats (at least all that I've tried so far) - and then, you can a do a
>> save as, and write out the open office format, which is a couple of xml
>> files zipped together.  So, this makes me think of two possible ways 
>> that
>> you could get at the content of the MS files in a text form you can 
>> index
>> (neither of which I have tried or even looked to see if they are 
>> possible)
>>
>> #1 - get the code for openoffice - it is open source - and use it for
>> parsing the MS documents into xml which could then be indexed
>>
>> #2 - if open office is programmatically drivable (which I don't know 
>> if it
>> is), fire up a copy of open office and use it to convert the files as
>> necessary.
>>
>> Just some suggestions.  Does anyone know much more about openoffice?  I
>> would be interested in knowing if either of these would be feasible.
>>
>> Dan
>>
>>
>>
>>
>> -----Original Message-----
>> From: Ewout Prangsma [mailto:e.prangsma@daisysoftware.com]
>> Sent: Wednesday, May 29, 2002 1:00 PM
>> To: Lucene Users List
>> Subject: Re: MS Word Search ??
>>
>>
>> Op Wednesday 29 May 2002 11:56, Karl Øie schreef:
>> > b: convert the documents to something that is accessable through 
>> java like
>> > xml, etc...
>>
>> We're using wvWare (wvware.com) to convert word to html (or text) and 
>> index
>> that and xpdf for converting PDF to text and index that. Any links on
>> indexing using POI converters (or other java converters) are very 
>> welcome!
>>
>> Ewout
>>
>> >
>> > the best way is to convert as the java api's for MSOffice documents 
>> still
>> > are under development
>> >
>> > mvh karl øie
>> >
>> > On Wednesday 29 May 2002 11:48, Rama Krishna wrote:
>> > > Hi,
>> > >
>> > > I am trying to build a search engine which search in MS Word, 
>> excel, ppt
>> > > and adobe pdf. I am not sure whether i can use Lucene for this or 
>> not.
>> > > pl. help me out in this regard.
>> > >
>> > >
>> > > Regards,
>> > > Ramakrishna
>> > >
>> > >
>> > > _________________________________________________________________
>> > > Chat with friends online, try MSN Messenger: 
>> http://messenger.msn.com
>>
>> -- 
>> Ewout Prangsma, Directeur
>> Daisy Software
>> Telefoon/fax: +31-77-3270305/3270306
>> Email: e.prangsma@daisysoftware.com
>> Website: www.daisysoftware.com
>> KvK Venlo nr. 12046144
>>
>>
>>
>>
>> -- 
>> To unsubscribe, e-mail:
>> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail:
>> <mailto:lucene-user-help@jakarta.apache.org>
>>
>> -- 
>> To unsubscribe, e-mail:   
>> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail: 
>> <mailto:lucene-user-help@jakarta.apache.org>
>
>
>
>
> -- 
> To unsubscribe, e-mail:   
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: 
> <mailto:lucene-user-help@jakarta.apache.org>
>
>




--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message