lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane James Vaucher <vauch...@cirano.qc.ca>
Subject Re: what web crawler work best with Lucene?
Date Thu, 22 Apr 2004 04:18:31 GMT
How big is the site?

I mostly use an inhouse solution, but I've used HttpUnit for web scrapping
small sites (because of its high-level api).

Here is a hello world example:
http://wiki.apache.org/jakarta-lucene/HttpUnitExample

For a small/simple site, small modifications to this class could suffice.
IT WILL NOT function on large sites because of memory problems.

For larger sites, there are questions like:

- memory:
For example, spidering all links on every page can lead to visiting too
many links. Keeping all visited links in memory can be problematic

- noise
If you get every page on your web site, you might be adding noise to the
search engine. Spider navigation rules can help out, like saying that you
should only follow links/index documents of a specific form like
www.mysite.com/news/article.jsp?articleid=xxx

- speed:
Too much speed can be bad if you doing 100 hits/sec on a site could hurt
it (especially if it's not you who are the webmaster)
Too little speed can be bad if you want to make sure you quickly get new
pages.

- categorisation:
You might want to separate information in your index. For example, you
might want a user to do a search in the documentation section or in the
press release section. This categorisation can be done by specifying
sections to the site, or a subsequent analysis of available docs.

-up-to-date information
You'll want to think of your update schedule, so that if you add a new
page, it gets indexed quickly. This problem also occurs when you modify an
existing page, you might want the modification to be detected rapidly.

HTH,
sv

On Thu, 22 Apr 2004, Tuan Jean Tee wrote:

> Have anyone implemented any open source web crawler with Lucene? I have
> a dynamic website and are looking at putting in a search tools. Your
> advice is very much appreciated.
>
> Thank you.
>
>
> IMPORTANT -
>
> This email and any attachments are confidential and may be privileged in
> which case neither is intended to be waived. If you have received this
> message in error, please notify us and remove it from your system. It is
> your responsibility to check any attachments for viruses and defects
> before opening or sending them on. Where applicable, liability is
> limited by the Solicitors Scheme approved under the Professional
> Standards Act 1994 (NSW). Minter Ellison collects personal information
> to provide and market our services. For more information about use,
> disclosure and access, see our privacy policy at www.minterellison.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message