Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 30557 invoked from network); 9 Jun 2003 21:55:33 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 9 Jun 2003 21:55:33 -0000 Received: (qmail 18736 invoked by uid 97); 9 Jun 2003 21:57:53 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@nagoya.betaversion.org Received: (qmail 18729 invoked from network); 9 Jun 2003 21:57:53 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 9 Jun 2003 21:57:53 -0000 Received: (qmail 29737 invoked by uid 500); 9 Jun 2003 21:55:24 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 29680 invoked from network); 9 Jun 2003 21:55:23 -0000 Received: from smtp-out4.iol.cz (194.228.2.92) by daedalus.apache.org with SMTP; 9 Jun 2003 21:55:23 -0000 Received: from fw.shark (gprs9-115.eurotel.cz [160.218.194.115]) by smtp-out4.iol.cz (Internet on Line ESMTP server) with ESMTP id 1F240310BA for ; Mon, 9 Jun 2003 23:55:57 +0200 (CEST) Received: from seznam.cz (0-3.shark [192.168.0.3]) by fw.shark (8.12.8/8.12.5) with ESMTP id h59LuNs5002623 for ; Mon, 9 Jun 2003 23:56:25 +0200 Message-ID: <3EE50284.3080807@seznam.cz> Date: Mon, 09 Jun 2003 23:56:20 +0200 From: Leo Galambos User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: High Capacity (Distributed) Crawler References: <20030609194447.57737.qmail@web12705.mail.yahoo.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Hi Otis. The first beta is done (without NIO). It needs, however, further testing. Unfortunatelly, I could not find enough servers which I may hit. I wanted to commit the robot as a part of egothor (it will use it in PULL mode), but we have a nice weather here, so I lost any motivation to play with PC ;-). What interface do you need for Lucene? Will you use PUSH (=the robot will modify Lucene's index) or PULL (=the engine will get deltas from the robot) mode? Tell me what you need and I will try to do all my best. -g- Otis Gospodnetic wrote: >Leo, > >Have you started this project? Where is it hosted? >It would be nice to see a few alternative implementations of a robust >and scalable java web crawler with the ability to index whatever it >fetches. > >Thanks, >Otis > >--- Leo Galambos wrote: > > >>Hi. >> >>I would like to write $SUBJ (HCDC), because LARM does not offer many >>options which are required by web/http crawling IMHO. Here is my >>list: >> >>1. I would like to manage the decision what will be gathered first - >>this would be based on pageRank, number of errors, connection speed >>etc. >>etc. >>2. pure JAVA solution without any DBMS/JDBC >>3. better configuration in case of an error >>4. NIO style as it is suggested by LARM specification >>5. egothor's filters for automatic processing of various data formats >>6. management of "Expires" HTTP-meta headers, heuristic rules which >>will >>describe how fast a page can expire (.php often expires faster than >>.html) >>7. reindexing without any data exports from a full-text index >>8. open protocol between the crawler and a full-text engine >> >>If anyone wants to join (or just extend the wish list), let me know, >>please. >> >>-g- >> >> >>--------------------------------------------------------------------- >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org >> >> >> > > >__________________________________ >Do you Yahoo!? >Yahoo! Calendar - Free online calendar with sync to Outlook(TM). >http://calendar.yahoo.com > >--------------------------------------------------------------------- >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org >For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org