lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ronnie Kolehmainen" <ron...@sunstone.se>
Subject SV: SV: Indexing HTML
Date Mon, 09 Dec 2002 08:16:00 GMT
HI,

these are the classes i use. I only use them to extract the text stuff, so
they don't have methods for getting document title and such. However text
extraction has worked fine for me.

The HtmlParser main method takes a file path as argument and outputs the
contents to a file named html.txt - useful when testing.

/Ronnie


> -----Ursprungligt meddelande-----
> Fran: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Skickat: den 7 december 2002 17:12
> Till: Lucene Users List
> Amne: Re: SV: Indexing HTML
>
>
> I have had good experiences with nekoHTML parser.
>
> Otis
>
> --- Leo Galambos <galambos@com-os2.ms.mff.cuni.cz> wrote:
> > > I'm not sure this is a solution to your problem. However, it seems
> > that the
> > > HTMLParser used by the IndexHTML class has problems parsing the
> > document
> > > (there is a test class included in the jar):
> > >
> > >
> > > >java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
> > > org.apache.lucene.demo.html.Test f01529.txt
> > > Title: Webcz.cz - Power of search
> > > Parse Aborted: Encountered "\'" at line 106, column 27.
> > > Was expecting one of:
> > >     <ArgName> ...
> > >     <TagEnd> ...
> > > /Ronnie
> >
> > Hi Ronnie!
> >
> > I know about it and the exception is handled well (see log file
> > below). I
> > have found a better example than 1529, try this:
> > http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go
> > throught
> > Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file
> > is
> > specific, i.e. it has two titles, two base tags etc.
> >
> > I have not debugger here, so I cannot find the line where is the bug.
> > If
> > you try your magic, please, let me know about the patch. :) THX
> >
> > -g-
> >
> >
> >
> > adding save/d00320/f01516.html
> > Parse Aborted: Lexical error at line 68, column 11.  Encountered:
> > "\u0178"
> > (376), after : ""
> > :
> > adding save/d00320/f01527.html
> > Parse Aborted: Encountered "=" at line 83, column 48.
> > Was expecting one of:
> >     <ArgName> ...
> >     <TagEnd> ...
> >
> > adding save/d00320/f01528.html
> >
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-user-help@jakarta.apache.org>
> >
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
>

Mime
View raw message