lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Re: lucene-user Digest 24 May 2002 14:31:22 -0000 Issue 105
Date Fri, 24 May 2002 18:44:24 GMT
>
> ------------------------------------------------------------------------
>
> Subject:
>
> powerpoint: sometimes it works...sometimes it doesn't
> From:
>
> Bruce Altner <baltner@hq.nasa.gov>
> Date:
>
> Wed, 22 May 2002 20:13:46 -0400
> To:
>
> lucene-user@jakarta.apache.org
>
>
> Greetings:
>
> I am brand new to Lucene so please forgive me if the following is too 
> naive for polite replies...
>
> I have built a web-based application to schedule and archive brown bag 
> talks. The system uses a database
> for the scheduling and searching by title, author, topic, abstract, 
> etc. but I want to add full text searching of the powerpoint
> files actually presented during the seminars.
>
> So I ran a quick index (using the demo API) on the ppt file of  a past 
> talk I'd given and Lucene handled it very well, finding hits 95% of 
> the time. I was quite impressed and excited about the possibilities 
> but then I indexed a more recent talk and Lucene failed completely, 
> never once finding a term.
>
> Any idea why it would work on one ppt file but not on another? The 
> first was created using powerpoint from Office 97 and the latter 
> (failed) example from Office 2000 so that's a strong possibility but I 
> wanted to run this by folks on the list for opinions.
>
> Thanks!
>
> Bruce
>
> PS My brown bag app is my second go-round with the jakarta Turbine 
> framework. Ai'nt open source great!
>

If I understand correctly what you did, it's really pure chance that it 
has worked in the first place. As far as I know, there are no built-in 
parsers for any document type (including PPT) in Lucene. Rather it 
attempts to handle every stream or String you give it as regular English 
text (you can change the language by selecting a different stemmer, but 
it will still assume plain text). There are various efforts underway 
that are building a framework for plugging different document parsers. 
There is also a project (on Apache?) that has an OLE parser written in 
Java that might be appropriate for working with PPT files. I think the 
first file you tried has had the text listed in plain string form where 
as the second file had them in some encoding (or compressed / 
encrypted). That's all I can say without knowing more about the PPT file 
format. If you can find a program or library that will extract text from 
a PPT file, you then should be able to easily use Lucene to index this 
text. This might not be as elegant as the final solution of the project 
I mentioned above, but this is the way everyone is using Lucene with 
various document types today.

Good luck.
Dmitry.



--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message