lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane James Vaucher <vauch...@cirano.qc.ca>
Subject Re: Time to index documents
Date Wed, 25 Aug 2004 22:24:41 GMT
JGuru explanation: 
http://www.jguru.com/faq/view.jsp?EID=1074228

I have no sample code for neko, I think nutch uses it though. For tidy, 
you can look at ant in the sandbox:

http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/ant/src/main/org/apache/lucene/ant/HtmlDocument.java?rev=1.3&view=markup

HTH,
sv

On Wed, 25 Aug 2004, Hetan Shah wrote:

> Do you have any pointers for sample code for them?
> Would highly appreciate it.
> Thanks.
> -H
> 
> Stephane James Vaucher wrote:
> 
> > I don't think that the demo parser is meant as a production 
> > system component. You can look at Tidy or NekoHtml. They cleanup your html 
> > and are probably optimised.
> > 
> > sv
> > 
> > On Wed, 25 Aug 2004, Hetan Shah wrote:
> > 
> > 
> >>Hello all,
> >>
> >>Is there a way to reduce the indexing time taken when the indexer is 
> >>indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
> >>do this. I am using IndexHTML class to create the index out of HTML files.
> >>
> >>Another issue that I see is every once in a while I get the following 
> >>output on the screen.
> >>
> >>adding ../31/1104852.html
> >>Parse Aborted: Encountered "\"" at line 7, column 1.
> >>Was expecting one of:
> >>     <ArgName> ...
> >>     "=" ...
> >>     <TagEnd> ...
> >>
> >>Any suggestions on preventing this from happening?
> >>
> >>Thanks in advance.
> >>-H
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message