lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven D. Majewski" <sd...@virginia.edu>
Subject Re: Pdf in Lucene?
Date Mon, 01 Dec 2008 22:37:50 GMT

On Dec 1, 2008, at 8:22 AM, Grant Ingersoll wrote:

>
> On Dec 1, 2008, at 8:01 AM, tiziano bernardi wrote:
>
>>
>> I tried to use pdfbox but gives me an error.
>> That the version of lucene and the pdfbox are incompatible.
>
> Lucene knows nothing about PDFBox, so I don't see how they could be  
> incompatible, unless your are referring to PDFBox's Lucene Document  
> creator, in which case, you should ask on the PDFBox mailing list.   
> I think, however, that it's pretty straightforward to create a  
> Lucene document from PDFBox, so you shouldn't need to rely on their  
> version.
>
> Personally, I'd have a look at Tika (http://lucene.apache.org/tika),  
> which wraps PDFBox (and other extraction libraries) and gives you  
> back SAX-like events via a ContentHandler, which you can then use to  
> create Lucene documents.  Else, I've been working on SOLR-284, which  
> integrates Tika into Solr, see https://issues.apache.org/jira/browse/SOLR-284
>
> -Grant
>


And for something out-of-the-box, you might also look at XTF:

	http://www.cdlib.org/inside/projects/xtf/

which will index and display text, html, pdf (using PDFbox ) and  
several XML text formats ( tei, ead, ... )
-- or you can look at the sources to see how they use PDFbox. 

-- Steve Majewski



>>
>> I use pdf box 0.7.3 and lucene 2.1.0> Date: Mon, 1 Dec 2008  
>> 11:43:00 +0000> From: ian.lea@gmail.com> To: java-user@lucene.apache.org 
>> > Subject: Re: Pdf in Lucene?> > Hi> > > Lucene only indexes text
 
>> so you'll have to get the text out of the PDF> and feed it to  
>> lucene.> > Google for lucene pdf, or go straight to http://www.pdfbox.org/

>> > > > --> Ian.> > > > 2008/12/1 tiziano bernardi  
>> <dk1982@hotmail.it>:> >> >> > Hi,> > I want to index
PDF files with  
>> lucene is possible?> > What like?> > Thanks Tiziano Bernardi> >
 
>> _________________________________________________________________>  
>> > Fanne di tutti i colori, personalizza la tua Hotmail!> > http://imagine-windowslive.com/Hotmail/#0

>> > >  
>> ---------------------------------------------------------------------> 
>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org>  
>> For additional commands, e-mail: java-user-help@lucene.apache.org>
>> _________________________________________________________________
>> 50 nuovi schemi per giocare su CrossWire! Accetta la sfida!
>> http://livesearch.games.msn.com/crosswire/play_it/
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message