lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane James Vaucher <vauch...@cirano.qc.ca>
Subject RE: Time to index documents
Date Thu, 26 Aug 2004 03:52:47 GMT
Hetan,

If you are using a corpus with multiple editors, I suggest that you 
use a cleaner like tidy as there might be weird stuff appearing in the 
html.

sv

On Thu, 26 Aug 2004, Karthik N S wrote:

> Hi Hetan
> 
> 
>    Th's the  major Problem of non Standatrdized Tags for HTML Document's
>   u are Indexing ,resulting in lag time taken for Indexing process....
> 
> 
>    If u can Tweak the HTMLParser.jj file within  lucene.zip   '/demo/html'
> file
>    [U have to have some Knowledge of JAVACC for this].
> 
> 
> 
> Karthik
> 
> -----Original Message-----
> From: Hetan Shah [mailto:Hetan.Shah@Sun.COM]
> Sent: Thursday, August 26, 2004 3:01 AM
> To: Lucene Users List
> Subject: Time to index documents
> 
> 
> Hello all,
> 
> Is there a way to reduce the indexing time taken when the indexer is
> indexing about 30,000 + files. It is roughly taking around 6-7 hours to
> do this. I am using IndexHTML class to create the index out of HTML files.
> 
> Another issue that I see is every once in a while I get the following
> output on the screen.
> 
> adding ../31/1104852.html
> Parse Aborted: Encountered "\"" at line 7, column 1.
> Was expecting one of:
>      <ArgName> ...
>      "=" ...
>      <TagEnd> ...
> 
> Any suggestions on preventing this from happening?
> 
> Thanks in advance.
> -H
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message