lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Ho <sebasti...@bii.a-star.edu.sg>
Subject Re: what web crawler work best with Lucene?
Date Thu, 22 Apr 2004 05:21:50 GMT
U can also use the html parser API available from Java 1.4.2. I tried it
last week with a simple program which retrieve html files and displaying
all HREF links in it. I done it within a day.

sebastian


On Thu, 2004-04-22 at 12:18, Stephane James Vaucher wrote:
> How big is the site?
> 
> I mostly use an inhouse solution, but I've used HttpUnit for web scrapping
> small sites (because of its high-level api).
> 
> Here is a hello world example:
> http://wiki.apache.org/jakarta-lucene/HttpUnitExample
> 
> For a small/simple site, small modifications to this class could suffice.
> IT WILL NOT function on large sites because of memory problems.
> 
> For larger sites, there are questions like:
> 
> - memory:
> For example, spidering all links on every page can lead to visiting too
> many links. Keeping all visited links in memory can be problematic
> 
> - noise
> If you get every page on your web site, you might be adding noise to the
> search engine. Spider navigation rules can help out, like saying that you
> should only follow links/index documents of a specific form like
> www.mysite.com/news/article.jsp?articleid=xxx
> 
> - speed:
> Too much speed can be bad if you doing 100 hits/sec on a site could hurt
> it (especially if it's not you who are the webmaster)
> Too little speed can be bad if you want to make sure you quickly get new
> pages.
> 
> - categorisation:
> You might want to separate information in your index. For example, you
> might want a user to do a search in the documentation section or in the
> press release section. This categorisation can be done by specifying
> sections to the site, or a subsequent analysis of available docs.
> 
> -up-to-date information
> You'll want to think of your update schedule, so that if you add a new
> page, it gets indexed quickly. This problem also occurs when you modify an
> existing page, you might want the modification to be detected rapidly.
> 
> HTH,
> sv
> 
> On Thu, 22 Apr 2004, Tuan Jean Tee wrote:
> 
> > Have anyone implemented any open source web crawler with Lucene? I have
> > a dynamic website and are looking at putting in a search tools. Your
> > advice is very much appreciated.
> >
> > Thank you.
> >
> >
> > IMPORTANT -
> >
> > This email and any attachments are confidential and may be privileged in
> > which case neither is intended to be waived. If you have received this
> > message in error, please notify us and remove it from your system. It is
> > your responsibility to check any attachments for viruses and defects
> > before opening or sending them on. Where applicable, liability is
> > limited by the Solicitors Scheme approved under the Professional
> > Standards Act 1994 (NSW). Minter Ellison collects personal information
> > to provide and market our services. For more information about use,
> > disclosure and access, see our privacy policy at www.minterellison.com.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message