lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ronnie Kolehmainen" <ron...@sunstone.se>
Subject SV: Indexing HTML
Date Wed, 04 Dec 2002 08:36:43 GMT
Dear Leo,

I'm not sure this is a solution to your problem. However, it seems that the
HTMLParser used by the IndexHTML class has problems parsing the document
(there is a test class included in the jar):


>java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
org.apache.lucene.demo.html.Test f01529.txt
Title: Webcz.cz - Power of search
Parse Aborted: Encountered "\'" at line 106, column 27.
Was expecting one of:
    <ArgName> ...
    <TagEnd> ...


If you look at the source of that document you can see there is a Javascript
with this problematic line:


	document.write('<s' + 'cript
src="http://ad.webcz.cz/adwebcz/adscript.asp?a=10&t=0&b=0&x=468&y=60&nocache
=' + nIndex + '">');
                        ^


Looks to me the HTMLParser does _not_ treat/handle the <script> tags
correct, i e ignore everything until </script>. If you check stdout there
should be error messages from the ParserThread class like the one above.

I tried parsing the same document with another html parser class without any
problems. Maybe try replacing the HTMLParser class used by HTMLDocument with
your own? Or edit the HTMLParser.jj file if you have javacc knowledge.


/Ronnie



> -----Ursprungligt meddelande-----
> Fran: Leo Galambos [mailto:galambos@com-os2.ms.mff.cuni.cz]
> Skickat: den 3 december 2002 20:32
> Till: lucene-user@jakarta.apache.org
> Amne: Indexing HTML
>
>
> I tried to use IndexHTML (demo) and Lucene 1.2 for indexing *.CZ, but
> Lucene often falls to never-ending loop. I've analyzed my data, so I know
> what file(s) sent Lucene down. I don't see anything special in the
> file(s), so I think, that it can go throught parser to main Lucene
> routines (and then the problem could be in Merger).
>
> Could you help me, please?
>
> One of the problematic files:
> http://com-os2.ms.mff.cuni.cz/bugs/f01529.txt
> My program (based on Lucene demo):
> http://com-os2.ms.mff.cuni.cz/bugs/IndexHTML.java
>
> Thank you very much.
>
> -g-
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message