lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tomi NA" <hef...@gmail.com>
Subject Re: Indexing MS Powerpoint files with Lucene
Date Fri, 08 Sep 2006 07:56:37 GMT
On 9/7/06, Andrzej Bialecki <ab@getopt.org> wrote:
> Tomi NA wrote:
> > On 9/7/06, Nick Burch <nick@torchbox.com> wrote:
> >> On Thu, 7 Sep 2006, Tomi NA wrote:
> >> > On 9/7/06, Venkateshprasanna <prasannahmv@yahoo.co.in> wrote:
> >> >> Is there any filter available for extracting text from MS
> >> Powerpoint files
> >> >> and indexing them?
> >> >> The lucene website suggests the POI project, which, it seems does not
> >> >> support PPT files as of now.
> >> >
> >> > http://jakarta.apache.org/poi/hslf/index.html
> >> >
> >> > It doesn't say poi doesn't support ppt. It just says support is
> >> limited.
> >> > Don't know exactly how limited, but certainly not useless for indexing
> >> > purposes.
> >>
> >> Support for editing and adding things to PowerPoint files is limited, as
> >> is getting out the finer points of fonts and positioning.
> >
> > Which brings me to another (off)topic: can lucene/nutch assign
> > different weights to tokens in the same document field? An obvious
> > example would be: "this text seems to be in large, bold, blinking
> > letters: I'll assume it's more important than the surrounding 8px
> > text."
>
> No, it can't (at least not yet). As a workaround you can extract these
> portions of text to another field (or multiple fields), and then add
> them with a higher boost. Then, expand your queries so that they include
> also this field. This way, if query matches these special tokens,
> results will get higher rank because of matching on this boosted field.

I thought a workaround like that would be needed. Still, it could give
useful results...though as a nutch user, the possibility is mostly
theoretical for me, as probably none of the existing parsers take into
account the formatting information. I could be completely wrong here,
so please, feel free to correct me.

t.n.a.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message