Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 79649 invoked from network); 6 Nov 2009 19:58:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Nov 2009 19:58:16 -0000 Received: (qmail 285 invoked by uid 500); 6 Nov 2009 19:58:16 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 224 invoked by uid 500); 6 Nov 2009 19:58:16 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 214 invoked by uid 99); 6 Nov 2009 19:58:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Nov 2009 19:58:16 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com designates 209.85.160.46 as permitted sender) Received: from [209.85.160.46] (HELO mail-pw0-f46.google.com) (209.85.160.46) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Nov 2009 19:58:08 +0000 Received: by pwi12 with SMTP id 12so922499pwi.5 for ; Fri, 06 Nov 2009 11:57:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=DAG0wCUxbchrBIXpj2/xyaX3/aEgUOM0P7QkSzIAhLY=; b=mP2VqkHMYTU7iu3peZubiQW6ok3iOZCkStUa1LNJBs1myeOX9Q406tsv4HKYDSfj3f JhiTniRKe6X59p0DxcLbZr1gQbJUUM2zpQgzCZDvuqHe0BM+lhft9R1MqtdfjIRsSpx/ 9zP/o5JwaLdis4ZaBKy2vr9uK27i7hRbou+3k= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=ENQwLzadCppe3i101P2YOYd2fv8tFgw0JW/KsNZBOKzvP4YvBUu2YrT7X6siyzoi5P dV9FKBV6YXRKwIZWk0Ea3z66ph6DM6XCs7UersYAiRYatQT0vN9tIO6SBzxLVgpO4m6U G9QVeZGSxN6JhTzSxK9SeSyUP/I9pYEcWcDEU= MIME-Version: 1.0 Received: by 10.114.86.11 with SMTP id j11mr7145771wab.73.1257537467062; Fri, 06 Nov 2009 11:57:47 -0800 (PST) In-Reply-To: <916DDB7F-CE01-469B-B027-D739B1D4A617@apache.org> References: <5eb9b7ae0911060506n3b60dbfdmd34e41fc3db95c45@mail.gmail.com> <916DDB7F-CE01-469B-B027-D739B1D4A617@apache.org> From: Ted Dunning Date: Fri, 6 Nov 2009 11:57:27 -0800 Message-ID: Subject: Re: TU Berlin Winter of Code Project To: mahout-user@lucene.apache.org Content-Type: multipart/alternative; boundary=00504502ead504195d0477b942d2 X-Virus-Checked: Checked by ClamAV on apache.org --00504502ead504195d0477b942d2 Content-Type: text/plain; charset=UTF-8 The question that I don't see addressed is whether you choose to use a fully streaming approach as is done in Bixo or whether you will use a document repository approach as is more common in most search engines. Hbase is reputedly ready enough to serve as a document repository. Using such an approach would be very helpful for the incremental nature of web crawls. What is the plan in this regard? On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll wrote: > > This is obviously only a first draft of what we think would be a suited > overall > architecture -- Ted Dunning, CTO DeepDyve --00504502ead504195d0477b942d2--