lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Indexing HTML files give following message
Date Sun, 12 Dec 2004 19:41:04 GMT
Hello,

This is probably due to some bad HTML.  The application you are using
is just a demo, and uses a JavaCC-based HTML parser, which may not be
resilient to invalid HTML.  For Lucene in Action we developed a little
extensible indexing framework, and for HTML indexing we used 2 tools to
handle HTML parsing: JTidy and NekoHTML.  Since the code for the book
is freely available....... http://www.manning.com.  NekoHTML knows how
to deal with some bad HTML, that's why I'm suggesting this.
The indexing framework could come handy for those working on various
'desktop search' applications (Roosster, LDesktop (if that's really
happening), Lucidity, etc.)

Otis


--- Hetan Shah <Hetan.Shah@Sun.COM> wrote:

> java org.apache.lucene.demo.IndexHTML -create -index
> /source/workarea/hs152827/newIndex ..
> adding ../0/10037.html
> adding ../0/10050.html
> adding ../0/1006132.html
> adding ../0/1013223.html
> Parse Aborted: Encountered "\"" at line 5, column 1.
> Was expecting one of:
>     <ArgName> ...
>     "=" ...
>     <TagEnd> ...
> 
> And then the indexing hangs on this line. Earlier it used to go on
> and
> index remaining pages in the directory. Any idea why would the
> indexer
> stop at this error.
> 
> Pointers are much needed and appreciated.
> -H
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message