lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Linwood" <j...@greenninja.com>
Subject Re: New xdoc for parsers
Date Tue, 08 Apr 2003 15:36:21 GMT
Cool.

Any reason not to include it as an xdoc though as well?  The Lucene site is
a little confusing to the newbie user who might just want to see if Lucene
can match Inktomi, Index Server, whatever by supporting Microsoft or PDF
formats.

It also needs a section on Indexing JSP files, since that gets asked a lot.

jeff
----- Original Message -----
From: "Otis Gospodnetic" <otis_gospodnetic@yahoo.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Tuesday, April 08, 2003 9:55 AM
Subject: Re: New xdoc for parsers


> Thanks Jeff, I put all this in the Lucene FAQ at jGuru.
>
> Otis
>
> --- Jeff Linwood <jeff@greenninja.com> wrote:
> > Since this question comes up all the time on the users list, and the
> > FAQ
> > entry is umm...unhelpful :) I created an xdoc listing all the parsers
> > I
> > knew about.
> >
> > Yes, I know this sort of duplicates the resources.xml xdoc, but this
> > is
> > is more descriptive, and more clear from the menu as to what it is. I
> > hope.
> >
> > I also included a patch to the project.xml stylesheet if this
> > contribution gets accepted.
> >
> > Thanks,
> > Jeff Linwood
> > > <?xml version="1.0"?>
> > <document>
> >     <properties>
> >     <author email="jeff@greenninja.com">Jeff Linwood</author>
> >     <title>Parsers - Jakarta Lucene</title>
> >     </properties>
> >     <body>
> >
> >     <section name="Introduction">
> >         <p>
> >         Lucene is capable of indexing any file format for documents,
> > but the application that uses the Lucene search engine is
> >         responsible for translating these document types into a
> > format that Lucene can understand.  Several of these formats
> >         can be indexed with open source or free solutions, and links
> > are given to the appropriate sites. Many Lucene users use
> >         more than one of these in their applications.
> >         </p>
> >     </section>
> >
> >     <section name="HTML">
> >         <subsection name="JavaCC and IndexHTML">
> >         An example that uses JavaCC to parse HTML into Lucene
> > Document objects is provided in the <a href="demo3.html">Lucene web
> >         application demo</a> that comes with the Lucene
> > distribution.
> >         </subsection>
> >         <subsection name="NekoHTML">
> > The <a href="http://www.apache.org/~andyc/neko/doc/html/">CyberNeko
> > HTML Parser</a> lets you parse HTML documents. It's
> > relatively easy to remove most of the tags from an HTML document
> > (or all if you want), and then use the ones you left in
> > to help create metadata for your Lucene document. NekoHTML also
> > provides a DOM model for navigating through the HTML.
> >         </subsection>
> >         <subsection name="JTidy">
> >         <a href="http://sourceforge.net/projects/jtidy/">JTidy</a>
> > cleans up HTML, and can provide a DOM interface to the HTML.
> >         files through a Java API.
> >         </subsection>
> >     </section>
> >
> >     <section name="PDF">
> >     <subsection name="PDFBox">
> >     <a href="http://pdfbox.org/">PDFBox</a> is a Java API from Ben
> > Litchfield that will let you access the contents of a
> >     PDF document. It comes with integration classes for Lucene to
> > translate a PDF into a Lucene document.
> >     </subsection>
> >     <subsection name="XPDF">
> >     <a href="http://www.foolabs.com/xpdf/">XPDF</a> is an open
> > source tool that is licensed under the GPL. It's not a Java
> >     tool, but there is a utility called pdftotext that can
> > translate PDF files into text files on most platforms from the
> >     command line.
> >     </subsection>
> >     <subsection name="PDF to HTML">
> >     Based on xpdf, there is a utility called <a
> > href="http://pdftohtml.sourceforge.net/">pdftohtml</a> that can
> > translate
> >     PDF files into HTML files. This is also not a Java application.
> >     </subsection>
> >     <subsection name="JPedal">
> >     <a href="http://www.jpedal.org/">JPedal</a> is a Java API for
> > extracting text and images from PDF documents.
> >     </subsection>
> >     <subsection name="TextMining.org">
> >     <a href="http://www.textmining.org/">Simple Text Extractor
> > Library</a> for use with PDF documents. Relies on PDFBox.
> >     </subsection>
> >
> >     </section>
> >
> >     <section name="XML">
> >     <subsection name="Lucene SAX/DOM indexing Demo">
> >     <a
> >
>
href="http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions
/XML-Indexing-Demo/">XML
> > Demo</a>
> >     This contribution is some sample code that demonstrates adding
> > simple XML documents into the index.
> >     It creates a new Document object for each file, and then
> > populates the Document with a Field
> >     for each XML element, recursively. There are examples included
> > for both SAX and DOM.
> >     </subsection>
> >     </section>
> >
> >     <section name="Word">
> >     <subsection name="POI">
> >     <a href="http://jakarta.apache.org/poi/">Jakarta Apache POI</a>
> > has an early development level Microsoft Word parser
> >     for versions of Word from Office 97, 2000, and XP.
> >     </subsection>
> >     <subsection name="TextMining.org">
> >     <a href="http://www.textmining.org/">Simple Text Extractor
> > Library</a> for use with PDF documents. Relies on POI.
> >     </subsection>
> >     </section>
> >
> >     <section name="Excel">
> >     <subsection name="POI">
> >     <a href="http://jakarta.apache.org/poi/">Jakarta Apache POI</a>
> > has an excellent Microsoft Excel parser
> >     for versions of Excel from Office 97, 2000, and XP.  You can
> > also modify Excel files with this tool.
> >     </subsection>
> >     </section>
> >
> >     <section name="RTF - Rich Text Format">
> >     <subsection name="TetraSix MajiX">
> >     <a href="http://www.tetrasix.com/">MajiX</a> is a translation
> > utility that will turn RTF (Rich Text Format) files
> >     into XML files. These XML files could be indexed like any other
> > XML file, or you could write some custom code. See the
> >     XML section of this page.
> >     </subsection>
> >     </section>
> >
> >
> >     </body>
> > </document>
> > > 21a22
> > >         <item name="Parsers"           href="/parsers.html"/>
> > >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Tax Center - File online, calculators, forms, and more
> http://tax.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message