lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: SV: Indexing HTML
Date Sat, 07 Dec 2002 16:11:45 GMT
I have had good experiences with nekoHTML parser.

Otis

--- Leo Galambos <galambos@com-os2.ms.mff.cuni.cz> wrote:
> > I'm not sure this is a solution to your problem. However, it seems
> that the
> > HTMLParser used by the IndexHTML class has problems parsing the
> document
> > (there is a test class included in the jar):
> > 
> > 
> > >java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
> > org.apache.lucene.demo.html.Test f01529.txt
> > Title: Webcz.cz - Power of search
> > Parse Aborted: Encountered "\'" at line 106, column 27.
> > Was expecting one of:
> >     <ArgName> ...
> >     <TagEnd> ...
> > /Ronnie
> 
> Hi Ronnie!
> 
> I know about it and the exception is handled well (see log file
> below). I
> have found a better example than 1529, try this:
> http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go
> throught
> Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file
> is
> specific, i.e. it has two titles, two base tags etc.
> 
> I have not debugger here, so I cannot find the line where is the bug.
> If
> you try your magic, please, let me know about the patch. :) THX
> 
> -g-
> 
> 
> 
> adding save/d00320/f01516.html
> Parse Aborted: Lexical error at line 68, column 11.  Encountered:
> "\u0178" 
> (376), after : ""
> :
> adding save/d00320/f01527.html
> Parse Aborted: Encountered "=" at line 83, column 48.
> Was expecting one of:
>     <ArgName> ...
>     <TagEnd> ...
> 
> adding save/d00320/f01528.html
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message