lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: New xdoc for parsers
Date Tue, 08 Apr 2003 15:40:44 GMT

--- Jeff Linwood <jeff@greenninja.com> wrote:
> Cool.
> 
> Any reason not to include it as an xdoc though as well?  The Lucene
> site is
> a little confusing to the newbie user who might just want to see if
> Lucene
> can match Inktomi, Index Server, whatever by supporting Microsoft or
> PDF formats.

Parsers are not really a part of Lucene, so I thought FAQ entries would
be better.  If it proves insufficient I'll add them directly to the
site.

> It also needs a section on Indexing JSP files, since that gets asked
> a lot.

Yes, correct, I will add that soon.

Otis


> jeff
> ----- Original Message -----
> From: "Otis Gospodnetic" <otis_gospodnetic@yahoo.com>
> To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
> Sent: Tuesday, April 08, 2003 9:55 AM
> Subject: Re: New xdoc for parsers
> 
> 
> > Thanks Jeff, I put all this in the Lucene FAQ at jGuru.
> >
> > Otis
> >
> > --- Jeff Linwood <jeff@greenninja.com> wrote:
> > > Since this question comes up all the time on the users list, and
> the
> > > FAQ
> > > entry is umm...unhelpful :) I created an xdoc listing all the
> parsers
> > > I
> > > knew about.
> > >
> > > Yes, I know this sort of duplicates the resources.xml xdoc, but
> this
> > > is
> > > is more descriptive, and more clear from the menu as to what it
> is. I
> > > hope.
> > >
> > > I also included a patch to the project.xml stylesheet if this
> > > contribution gets accepted.
> > >
> > > Thanks,
> > > Jeff Linwood
> > > > <?xml version="1.0"?>
> > > <document>
> > >     <properties>
> > >     <author email="jeff@greenninja.com">Jeff Linwood</author>
> > >     <title>Parsers - Jakarta Lucene</title>
> > >     </properties>
> > >     <body>
> > >
> > >     <section name="Introduction">
> > >         <p>
> > >         Lucene is capable of indexing any file format for
> documents,
> > > but the application that uses the Lucene search engine is
> > >         responsible for translating these document types into a
> > > format that Lucene can understand.  Several of these formats
> > >         can be indexed with open source or free solutions, and
> links
> > > are given to the appropriate sites. Many Lucene users use
> > >         more than one of these in their applications.
> > >         </p>
> > >     </section>
> > >
> > >     <section name="HTML">
> > >         <subsection name="JavaCC and IndexHTML">
> > >         An example that uses JavaCC to parse HTML into Lucene
> > > Document objects is provided in the <a href="demo3.html">Lucene
> web
> > >         application demo</a> that comes with the Lucene
> > > distribution.
> > >         </subsection>
> > >         <subsection name="NekoHTML">
> > > The <a
> href="http://www.apache.org/~andyc/neko/doc/html/">CyberNeko
> > > HTML Parser</a> lets you parse HTML documents. It's
> > > relatively easy to remove most of the tags from an HTML document
> > > (or all if you want), and then use the ones you left in
> > > to help create metadata for your Lucene document. NekoHTML also
> > > provides a DOM model for navigating through the HTML.
> > >         </subsection>
> > >         <subsection name="JTidy">
> > >         <a
> href="http://sourceforge.net/projects/jtidy/">JTidy</a>
> > > cleans up HTML, and can provide a DOM interface to the HTML.
> > >         files through a Java API.
> > >         </subsection>
> > >     </section>
> > >
> > >     <section name="PDF">
> > >     <subsection name="PDFBox">
> > >     <a href="http://pdfbox.org/">PDFBox</a> is a Java API from
> Ben
> > > Litchfield that will let you access the contents of a
> > >     PDF document. It comes with integration classes for Lucene to
> > > translate a PDF into a Lucene document.
> > >     </subsection>
> > >     <subsection name="XPDF">
> > >     <a href="http://www.foolabs.com/xpdf/">XPDF</a> is an open
> > > source tool that is licensed under the GPL. It's not a Java
> > >     tool, but there is a utility called pdftotext that can
> > > translate PDF files into text files on most platforms from the
> > >     command line.
> > >     </subsection>
> > >     <subsection name="PDF to HTML">
> > >     Based on xpdf, there is a utility called <a
> > > href="http://pdftohtml.sourceforge.net/">pdftohtml</a> that can
> > > translate
> > >     PDF files into HTML files. This is also not a Java
> application.
> > >     </subsection>
> > >     <subsection name="JPedal">
> > >     <a href="http://www.jpedal.org/">JPedal</a> is a Java API for
> > > extracting text and images from PDF documents.
> > >     </subsection>
> > >     <subsection name="TextMining.org">
> > >     <a href="http://www.textmining.org/">Simple Text Extractor
> > > Library</a> for use with PDF documents. Relies on PDFBox.
> > >     </subsection>
> > >
> > >     </section>
> > >
> > >     <section name="XML">
> > >     <subsection name="Lucene SAX/DOM indexing Demo">
> > >     <a
> > >
> >
>
href="http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions
> /XML-Indexing-Demo/">XML
> > > Demo</a>
> > >     This contribution is some sample code that demonstrates
> adding
> > > simple XML documents into the index.
> > >     It creates a new Document object for each file, and then
> > > populates the Document with a Field
> > >     for each XML element, recursively. There are examples
> included
> > > for both SAX and DOM.
> > >     </subsection>
> > >     </section>
> > >
> > >     <section name="Word">
> > >     <subsection name="POI">
> > >     <a href="http://jakarta.apache.org/poi/">Jakarta Apache
> POI</a>
> > > has an early development level Microsoft Word parser
> > >     for versions of Word from Office 97, 2000, and XP.
> > >     </subsection>
> > >     <subsection name="TextMining.org">
> > >     <a href="http://www.textmining.org/">Simple Text Extractor
> > > Library</a> for use with PDF documents. Relies on POI.
> > >     </subsection>
> > >     </section>
> > >
> > >     <section name="Excel">
> > >     <subsection name="POI">
> > >     <a href="http://jakarta.apache.org/poi/">Jakarta Apache
> POI</a>
> > > has an excellent Microsoft Excel parser
> > >     for versions of Excel from Office 97, 2000, and XP.  You can
> > > also modify Excel files with this tool.
> > >     </subsection>
> > >     </section>
> > >
> > >     <section name="RTF - Rich Text Format">
> > >     <subsection name="TetraSix MajiX">
> > >     <a href="http://www.tetrasix.com/">MajiX</a> is a translation
> > > utility that will turn RTF (Rich Text Format) files
> > >     into XML files. These XML files could be indexed like any
> other
> > > XML file, or you could write some custom code. See the
> > >     XML section of this page.
> > >     </subsection>
> > >     </section>
> > >
> > >
> > >     </body>
> > > </document>
> > > > 21a22
> > > >         <item name="Parsers"           href="/parsers.html"/>
> > > >
> >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail:
> lucene-dev-help@jakarta.apache.org
> >
> >
> > __________________________________________________
> > Do you Yahoo!?
> > Yahoo! Tax Center - File online, calculators, forms, and more
> > http://tax.yahoo.com
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> >
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - File online, calculators, forms, and more
http://tax.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message