lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kalani Ruwanpathirana" <kala...@gmail.com>
Subject Re: Pdf in Lucene?
Date Thu, 04 Dec 2008 09:49:26 GMT
Hi,

In my case I used PDFBox, just to extract the text from PDF document and
then I created the Lucene document giving the extracted text. (I didn't use
the PDFBox built in Lucene search engine). So I didn't get any
incompatibility problems.

This blog post shows the way.
http://kalanir.blogspot.com/2008/08/indexing-pdf-documents-with-lucene.html

It worked perfect for me.

Thanks.

On Tue, Dec 2, 2008 at 2:33 PM, tiziano bernardi <dk1982@hotmail.it> wrote:

>
>
> This is the exception:
> Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache.lucene.document.Document.add(Lorg/apache/lucene/document/Field;)V
> at
> org.pdfbox.searchengine.lucene.LucenePDFDocument.addUnindexedField(LucenePDFDocument.java:224)
> at
> org.pdfbox.searchengine.lucene.LucenePDFDocument.convertDocument(LucenePDFDocument.java:265)
> at
> org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDocument.java:377)
> at SimplePdfSearch.main(SimplePdfSearch.java:30)
>
> I thank you for the time you spent
> > From: gsingers@apache.org> To: java-user@lucene.apache.org> Subject: Re:
> Pdf in Lucene?> Date: Mon, 1 Dec 2008 17:40:12 -0500> > I certainly don't
> either, since you haven't said what the actual > exception is. If I had to
> guess, though, I would say it is the line> Document document =
> LucenePDFDocument.getDocument> > And that the Lucene library expected by
> PDFBox is not the same version > of Lucene you are using. I would suggest
> not relying on PDFBox to > create your document, and instead look at the
> PDFBox calls that you > need to make to then create your Document.> > > On
> Dec 1, 2008, at 9:18 AM, tiziano bernardi wrote:> > >> >> > this
is my
> class, I use eclipse and I haven't any errors.Do not > > understand where
> the problem ....> >> >> > import java.io.File;> > import
> java.io.IOException;> >> > import org.apache.lucene.analysis.Analyzer;>
>
> import org.apache.lucene.analysis.standard.StandardAnalyzer;> > import
> org.apache.lucene.document.Document;> > import
> org.apache.lucene.index.IndexWriter;> > import
> org.apache.lucene.index.Term;> > import org.apache.lucene.search.Hits;> >
> import org.apache.lucene.search.IndexSearcher;> > import
> org.apache.lucene.search.Query;> > import
> org.apache.lucene.search.TermQuery;> > import
> org.apache.lucene.store.Directory;> > import
> org.apache.lucene.store.RAMDirectory;> > import
> org.pdfbox.searchengine.lucene.LucenePDFDocument;> >> > public final class
> SimplePdfSearch> > {> > private static final String PDF_FILE_PATH =
> "C:\\Users\\Tiziano\ > > \Desktop\\doc_di_prova\\prova.pdf";> > private
> static final String SEARCH_TERM = "prova";> >> > public static final void
> main(String[] args) throws IOException> > {> > Directory directory = null;>
> >> > try> > {> > File pdfFile = new File(PDF_FILE_PATH);> >
Document
> document = LucenePDFDocument.getDocument(pdfFile);> >> > directory = new
> RAMDirectory();> >> > IndexWriter indexWriter = null;> >> > try>
> {> >
> Analyzer analyzer = new StandardAnalyzer();> > indexWriter = new
> IndexWriter(directory, analyzer, true);> >> >
> indexWriter.addDocument(document);> > }> > finally> > {> > if
(indexWriter
> != null)> > {> > try> > {> > indexWriter.close();> > }>
> catch (IOException
> ignore)> > {> > // Ignore> > }> >> > indexWriter = null;>
> }> > }> >> >
> IndexSearcher indexSearcher = null;> >> > try> > {> > indexSearcher
= new
> IndexSearcher(directory);> >> > Term term = new Term("contents",
> SEARCH_TERM);> > Query query = new TermQuery(term);> >> > Hits hits
=
> indexSearcher.search(query);> >> > System.out.println((hits.length() != 0)
?
> "Found" : "Not Found");> > }> > finally> > {> > if (indexSearcher
!= null)>
> > {> > try> > {> > indexSearcher.close();> > }> > catch
(IOException
> ignore)> > {> > // Ignore> > }> >> > indexSearcher = null;>
> }> > }> > }> >
> finally> > {> > if (directory != null)> > {> > try> > {>
>
> directory.close();> > }> > catch (IOException ignore)> > {> >
// Ignore> >
> }> >> > directory = null;> > }> > }> > }> > }>
From: gsingers@apache.org>
> To: java-user@lucene.apache.org> > > Subject: Re: Pdf in Lucene?> Date:
> Mon, 1 Dec 2008 08:22:58 -0500> > > > > On Dec 1, 2008, at 8:01 AM, tiziano
> bernardi wrote:> > >> > I > > tried to use pdfbox but gives me an
error.> >
> That the version of > > lucene and the pdfbox are incompatible.> > Lucene
> knows nothing > > about PDFBox, so I don't see how they could be >
> incompatible, > > unless your are referring to PDFBox's Lucene Document >
> creator, in > > which case, you should ask on the PDFBox mailing list. I >
> think, > > however, that it's pretty straightforward to create a Lucene > >
> > document from PDFBox, so you shouldn't need to rely on their > > version.>
> > Personally, I'd have a look at Tika (http://lucene.apache.org/tika > >
> ), > which wraps PDFBox (and other extraction libraries) and gives > > you
> back > SAX-like events via a ContentHandler, which you can then > > use to
> create > Lucene documents. Else, I've been working on > > SOLR-284, which >
> integrates Tika into Solr, see
> https://issues.apache.org/jira/browse/SOLR-284 > > > > -Grant> > >>
> I
> use pdf box 0.7.3 and lucene 2.1.0> Date: Mon, > > 1 Dec 2008 11:43:00 >
>
> +0000> From: ian.lea@gmail.com> To: java-user@lucene.apache.org > > >
> >
> Subject: Re: Pdf in Lucene?> > Hi> > > Lucene only indexes > > text
so > >
> you'll have to get the text out of the PDF> and feed it > > to lucene.> >
>
> > Google for lucene pdf, or go straight to http://www.pdfbox.org/ > > > >
> > > > --> Ian.> > > > 2008/12/1 tiziano bernardi <dk1982@hotmail.it
> >
> >:> > > >> >> > Hi,> > I want to index PDF files with
lucene is > >
> possible?> > > > What like?> > Thanks Tiziano Bernardi> > >
> > >
> _________________________________________________________________> > > >
> >
> Fanne di tutti i colori, personalizza la tua Hotmail!> >
> http://imagine-windowslive.com/Hotmail/#0 > > > > > > > > >
>
> --------------------------------------------------------------------- > > >
> > > To unsubscribe, e-mail: java-user- > > unsubscribe@lucene.apache.org>
> > > For additional commands, e-mail: java-user-help@lucene.apache.org > >
> >> > > > _________________________________________________________________>
> > > > 50 nuovi schemi per giocare su CrossWire! Accetta la sfida!> >
> http://livesearch.games.msn.com/crosswire/play_it/ > > > >
> --------------------------> Grant Ingersoll> > Lucene Helpful > > Hints:>
> http://wiki.apache.org/lucene-java/BasicsOfPerformance>
> http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > >
> > > > > > >
> > --------------------------------------------------------------------- > >
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org> > > For
> additional commands, e-mail: java-user-help@lucene.apache.org>> >
> _________________________________________________________________> > Vai
> oltre le parole, scarica il nuovo Messenger!> >
> http://download.live.com/?mkt=it-it> > --------------------------> Grant
> Ingersoll> > Lucene Helpful Hints:>
> http://wiki.apache.org/lucene-java/BasicsOfPerformance>
> http://wiki.apache.org/lucene-java/LuceneFAQ> > > > > > > > >
> > >
> ---------------------------------------------------------------------> To
> unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org> For
> additional commands, e-mail: java-user-help@lucene.apache.org>
> _________________________________________________________________
> Vai oltre le parole, scarica il nuovo Messenger!
> http://download.live.com/?mkt=it-it
>



-- 
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message