lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "LuceneFAQ" by ChasEmerick
Date Wed, 02 Dec 2009 10:21:30 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The "LuceneFAQ" page has been changed by ChasEmerick.
The comment on this change is: added link to PDFTextStream library for indexing PDF documents,
reordered such libraries so native-Java libs are listed first.
http://wiki.apache.org/lucene-java/LuceneFAQ?action=diff&rev1=140&rev2=141

--------------------------------------------------

  
  [[http://pdfbox.org/|PDFBox]] is a Java API from Ben Litchfield that will let you access
the contents of a PDF document. It comes with integration classes for Lucene to translate
a PDF into a Lucene document.
  
+ [[http://www.jpedal.org/|JPedal]] is a Java API for extracting text and images from PDF
documents.
+ 
+ [[http://snowtide.com/|PDFTextStream]] is a Java API for extracting text, metadata, and
form data from PDF documents.  It also comes with an [[http://snowtide.com/easy_lucene_integration|integration
module]] making it easier to convert a PDF document into a Lucene document.
+ 
  [[http://www.foolabs.com/xpdf/|XPDF]]  is an open source tool that is licensed under the
GPL. It's not a Java tool, but there is a utility called pdftotext that can translate PDF
files into text files on most platforms from the command line.
  
  Based on xpdf, there is a utility called [[http://pdftohtml.sourceforge.net/|pdftohtml]]
that can translate PDF files into HTML files. This is also not a Java application.
- 
- [[http://www.jpedal.org/|JPedal]] is a Java API for extracting text and images from PDF
documents.
  
  
  ==== How can I index JSP files? ====

Mime
View raw message