lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Calvo" <dca...@task.com.br>
Subject RE: HTMLParser
Date Sat, 16 Feb 2002 04:32:17 GMT
Maybe...I'll have to give it a try first

Anyway, I was playing with Lucene's HTMParser in order to understand a little better how JavaCC
works. My real interest is in PDF
and RTF parsers. I've tried Websearch PDF parser but it only worked well with the examples
provided. I wasn't able to parse
correctly even PDF files distributed by Adobe. I've also had a lot of trouble with files converted
to PDF (probably via dvi2pdf or
something like that). Recently I read on this list (or maybe it was on the users list) that
someone else was having trouble with
both Websearch and PJ library parsers.

I've just downloaded Adobe's PDF Specification and later I'll try to see if there's any room
for improvement in Websearch code. I
know PDF has various features (compression, cryptography, etc.) that complicate the parsing
and I'm not willing to spend much time
doing this but I'll probably try something.

--Daniel

> -----Original Message-----
> From: Paulo Gaspar [mailto:paulo.gaspar@krankikom.de]
> Sent: sexta-feira, 15 de fevereiro de 2002 23:14
> To: Lucene Developers List
> Subject: RE: HTMLParser
>
>
> Can the following Xerces based HTML parser be interesting for
> your work?
>
> This is just the initial ANNOUNCE but there are further
> developments.
>
>
> Have fun,
> Paulo Gaspar
>
> > -----Original Message-----
> > From: Andy Clark [mailto:andyc@apache.org]
> > Sent: Saturday, February 09, 2002 4:16 AM
> > To: general@xml.apache.org
> > Cc: xerces-j-dev@xml.apache.org
> > Subject: [ANNOUNCE] Xerces HTML Parser
> >
> >
> > For a long time users have asked if Xerces can parse HTML files.
> > But since most HTML documents are not well-formed XML documents,
> > it is generally not possible to use a conforming XML parser to
> > read HTML documents.
> >
> > However, the Xerces Native Interface (XNI) that is the foundation
> > of the Xerces2 implementation defines a framework that allows
> > different kinds of parsers to be constructed by connecting a
> > pipeline of parser components. Therefore, as long as a component
> > can be written that generates the appropriate XNI "events", then
> > it can be used to emit SAX events, build DOM trees, or anything
> > else that you can think of.
> >
> > So, as a fun little exercise, I have written a basic HTML parser
> > using XNI. It consists of an HTML scanner component that can scan
> > HTML files and generate XNI events and a tag balancing component.
> > The tag balancer cleans up the events produced by the scanner,
> > balancing mismatched tags and adding tags where necessary. And
> > it does all of this in a streaming manner to minimize the amount
> > of memory required.
> >
> > Since I wrote the HTML parser as an example of using XNI and
> > because the code is considered alpha quality (but it seems to
> > work quite well, actually!), I am posting the code with a very
> > limited license. Even though it contains the complete source
> > code for the HTML parser, the license only allows the user to
> > experiment but gives no right to actually use the code in a
> > product.
> >
> > If the source isn't "free" or "open", why release it at all?
> > I want to get an idea of what people think of the code first.
> > Then, if there's enough interest, I would like to either donate
> > the code to the Xerces-J project or make it available elsewhere
> > under a true open source license.
> >
> > So, if you've been looking for a way to parse HTML documents
> > please try out the HTML parser and let me know what you think.
> > There should be enough information in the documentation to get
> > you started. Check out the "NekoHTML" project listed on my
> > Apache web site: http://www.apache.org/~andyc/
> >
> > Have fun!
> >
> > --
> > Andy Clark * andyc@apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> > For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
> >
>
> --
> To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message