lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: Running Unit Tests from inside Eclipse
Date Thu, 28 Jun 2007 15:29:51 GMT
On 6/28/07, Eric Pugh <epugh@opensourceconnections.com> wrote:
> > >  I have a PDF handler modeled on the CSVHandler that allows
> > > you to stream a PDF document to Solr and extract the text and store
> > > it.
> >
> > Cool!
> >
> > Any thoughts of a general framework for going from unstructured
> > document -> lucene document with fields?  It feels like utilizing
> > Apache Tika here would be the way to go (although it's in the really
> > early stages).
> >
> > -Yonik
> >
> Humm...  So I have a PDF, Word, Excel, and Powerpoint, all as seperate
> handlers.  And there is a lot of duplication between them...  I may
> try and pull out the common stuff into some sort of
> AbstractRichDocumentHandler, and then just add the special sauce for
> each one.   I am close to having the basic unit tests, modeled on
> CSVHandler, and will post a JIRA issue with it.

Another thing to consider is document type/charset/language detection.
People may not want to have to hit a different URL for each different
type of document.

> I looked for Tika, but didn't see it, what is the URL?

It's *really* early (entered the incubator in March)
http://incubator.apache.org/tika/
http://www.nabble.com/Apache-Tika---Development-f20913.html


-Yonik

Mime
View raw message