lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leo Galambos <galam...@com-os2.ms.mff.cuni.cz>
Subject Re: SV: Indexing HTML
Date Sat, 07 Dec 2002 15:41:03 GMT
> I'm not sure this is a solution to your problem. However, it seems that the
> HTMLParser used by the IndexHTML class has problems parsing the document
> (there is a test class included in the jar):
> 
> 
> >java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
> org.apache.lucene.demo.html.Test f01529.txt
> Title: Webcz.cz - Power of search
> Parse Aborted: Encountered "\'" at line 106, column 27.
> Was expecting one of:
>     <ArgName> ...
>     <TagEnd> ...
> /Ronnie

Hi Ronnie!

I know about it and the exception is handled well (see log file below). I
have found a better example than 1529, try this:
http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go throught
Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file is
specific, i.e. it has two titles, two base tags etc.

I have not debugger here, so I cannot find the line where is the bug. If
you try your magic, please, let me know about the patch. :) THX

-g-



adding save/d00320/f01516.html
Parse Aborted: Lexical error at line 68, column 11.  Encountered: "\u0178" 
(376), after : ""
:
adding save/d00320/f01527.html
Parse Aborted: Encountered "=" at line 83, column 48.
Was expecting one of:
    <ArgName> ...
    <TagEnd> ...

adding save/d00320/f01528.html



--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message