lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: New xdoc for parsers
Date Tue, 08 Apr 2003 14:55:20 GMT
Thanks Jeff, I put all this in the Lucene FAQ at jGuru.

Otis

--- Jeff Linwood <jeff@greenninja.com> wrote:
> Since this question comes up all the time on the users list, and the
> FAQ 
> entry is umm...unhelpful :) I created an xdoc listing all the parsers
> I 
> knew about.
> 
> Yes, I know this sort of duplicates the resources.xml xdoc, but this
> is 
> is more descriptive, and more clear from the menu as to what it is. I
> hope.
> 
> I also included a patch to the project.xml stylesheet if this 
> contribution gets accepted.
> 
> Thanks,
> Jeff Linwood
> > <?xml version="1.0"?>
> <document>
>     <properties>
>     <author email="jeff@greenninja.com">Jeff Linwood</author>
>     <title>Parsers - Jakarta Lucene</title>
>     </properties>
>     <body>
> 
>     <section name="Introduction">
>         <p>
>         	Lucene is capable of indexing any file format for documents,
> but the application that uses the Lucene search engine is
>         	responsible for translating these document types into a
> format that Lucene can understand.  Several of these formats
>         	can be indexed with open source or free solutions, and links
> are given to the appropriate sites. Many Lucene users use
>         	more than one of these in their applications.
>         </p>
>     </section>
>     
>     <section name="HTML">
>         <subsection name="JavaCC and IndexHTML">
>         	An example that uses JavaCC to parse HTML into Lucene
> Document objects is provided in the <a href="demo3.html">Lucene web
>         	application demo</a> that comes with the Lucene
> distribution.
>         </subsection>
>         <subsection name="NekoHTML">
> 		The <a href="http://www.apache.org/~andyc/neko/doc/html/">CyberNeko
> HTML Parser</a> lets you parse HTML documents. It's 
> 		relatively easy to remove most of the tags from an HTML document
> (or all if you want), and then use the ones you left in
> 		to help create metadata for your Lucene document. NekoHTML also
> provides a DOM model for navigating through the HTML.
>         </subsection>        
>         <subsection name="JTidy">
>         	<a href="http://sourceforge.net/projects/jtidy/">JTidy</a>
> cleans up HTML, and can provide a DOM interface to the HTML.
>         	files through a Java API.
>         </subsection>        
>     </section>    
>     
>     <section name="PDF">
>     	<subsection name="PDFBox">
>     		<a href="http://pdfbox.org/">PDFBox</a> is a Java API from Ben
> Litchfield that will let you access the contents of a 
> 	    	PDF document. It comes with integration	classes for Lucene to
> translate a PDF into a Lucene document.
>     	</subsection>
>     	<subsection name="XPDF">
>     		<a href="http://www.foolabs.com/xpdf/">XPDF</a> is an open
> source tool that is licensed under the GPL. It's not a Java
>     		tool, but there is a utility called pdftotext that can
> translate PDF files into text files on most platforms from the
>     		command line.
>     	</subsection>
>     	<subsection name="PDF to HTML">
>     		Based on xpdf, there is a utility called <a
> href="http://pdftohtml.sourceforge.net/">pdftohtml</a> that can
> translate
>     		PDF files into HTML files. This is also not a Java application.
>     	</subsection>
>     	<subsection name="JPedal">
>     		<a href="http://www.jpedal.org/">JPedal</a> is a Java API for
> extracting text and images from PDF documents.
>     	</subsection>    	
>     	<subsection name="TextMining.org">
>     		<a href="http://www.textmining.org/">Simple Text Extractor
> Library</a> for use with PDF documents. Relies on PDFBox.
>     	</subsection>
>     		
>     </section>
> 
>     <section name="XML">
>     	<subsection name="Lucene SAX/DOM indexing Demo">
>     		<a
>
href="http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/XML-Indexing-Demo/">XML
> Demo</a>
>     		This contribution is some sample code that demonstrates adding
> simple XML documents into the index. 
>     		It creates a new Document object for each file, and then
> populates the Document with a Field 
>     		for each XML element, recursively. There are examples included
> for both SAX and DOM. 
>     	</subsection>
>     </section>
>     
>     <section name="Word">
>     	<subsection name="POI">
>     		<a href="http://jakarta.apache.org/poi/">Jakarta Apache POI</a>
> has an early development level Microsoft Word parser
>     		for versions of Word from Office 97, 2000, and XP.
>     	</subsection>
>     	<subsection name="TextMining.org">
>     		<a href="http://www.textmining.org/">Simple Text Extractor
> Library</a> for use with PDF documents. Relies on POI.
>     	</subsection>
>     </section>
>     
>     <section name="Excel">
>     	<subsection name="POI">
>     		<a href="http://jakarta.apache.org/poi/">Jakarta Apache POI</a>
> has an excellent Microsoft Excel parser
>     		for versions of Excel from Office 97, 2000, and XP.  You can
> also modify Excel files with this tool.
>     	</subsection>
>     </section>
> 
>     <section name="RTF - Rich Text Format">
>     	<subsection name="TetraSix MajiX">
>     		<a href="http://www.tetrasix.com/">MajiX</a> is a translation
> utility that will turn RTF (Rich Text Format) files
>     		into XML files. These XML files could be indexed like any other
> XML file, or you could write some custom code. See the 
>     		XML section of this page.
>     	</subsection>
>     </section>
> 
> 
>     </body>
> </document>
> > 21a22
> >         <item name="Parsers"           href="/parsers.html"/>  
> >
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - File online, calculators, forms, and more
http://tax.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message