lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leo Galambos <>
Subject Re: SV: Indexing HTML
Date Sat, 07 Dec 2002 15:41:03 GMT
> I'm not sure this is a solution to your problem. However, it seems that the
> HTMLParser used by the IndexHTML class has problems parsing the document
> (there is a test class included in the jar):
> >java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
> org.apache.lucene.demo.html.Test f01529.txt
> Title: - Power of search
> Parse Aborted: Encountered "\'" at line 106, column 27.
> Was expecting one of:
>     <ArgName> ...
>     <TagEnd> ...
> /Ronnie

Hi Ronnie!

I know about it and the exception is handled well (see log file below). I
have found a better example than 1529, try this: This file cannot go throught
Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file is
specific, i.e. it has two titles, two base tags etc.

I have not debugger here, so I cannot find the line where is the bug. If
you try your magic, please, let me know about the patch. :) THX


adding save/d00320/f01516.html
Parse Aborted: Lexical error at line 68, column 11.  Encountered: "\u0178" 
(376), after : ""
adding save/d00320/f01527.html
Parse Aborted: Encountered "=" at line 83, column 48.
Was expecting one of:
    <ArgName> ...
    <TagEnd> ...

adding save/d00320/f01528.html

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message