Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EAAB8D824 for ; Fri, 7 Sep 2012 15:27:01 +0000 (UTC) Received: (qmail 68441 invoked by uid 500); 7 Sep 2012 15:26:58 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 68399 invoked by uid 500); 7 Sep 2012 15:26:58 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 68386 invoked by uid 99); 7 Sep 2012 15:26:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Sep 2012 15:26:58 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=FSL_RCVD_USER,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [62.23.130.172] (HELO zig.albert.com) (62.23.130.172) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Sep 2012 15:26:52 +0000 Received: from localhost (localhost [127.0.0.1]) by zig.albert.com (Postfix) with ESMTP id 1FCF1184D34 for ; Fri, 7 Sep 2012 17:31:59 +0200 (CEST) X-Virus-Scanned: amavisd-new at albert.com Received: from zig.albert.com ([127.0.0.1]) by localhost (zig.albert.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Qe3LKuMZmtyh for ; Fri, 7 Sep 2012 17:31:58 +0200 (CEST) Received: from Dominiques-MacBook-Pro.local (85-170-18-255.rev.numericable.fr [85.170.18.255]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by zig.albert.com (Postfix) with ESMTP id 81B6A184CD7 for ; Fri, 7 Sep 2012 17:31:58 +0200 (CEST) Message-ID: <504A1223.7070901@eolya.fr> Date: Fri, 07 Sep 2012 17:26:27 +0200 From: Dominique Bejean Reply-To: dominique.bejean@eolya.fr User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:15.0) Gecko/20120824 Thunderbird/15.0 MIME-Version: 1.0 To: solr-user@lucene.apache.org Subject: Re: Website (crawler for) indexing References: <852DD8A9FDDF734C809AC3CFCFDE8695F22C567D@WSMV115.corp.vishayint.com> In-Reply-To: <852DD8A9FDDF734C809AC3CFCFDE8695F22C567D@WSMV115.corp.vishayint.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org May be you can take a look at Crawl-Anywhere which have administration web interface, solr indexer and search web application. www.crawl-anywhere.com Regards. Dominique Le 05/09/12 17:05, Lochschmied, Alexander a �crit : > This may be a bit off topic: How do you index an existing website and control the data going into index? > > We already have Java code to process the HTML (or XHTML) and turn it into a SolrJ Document (removing tags and other things we do not want in the index). We use SolrJ for indexing. > So I guess the question is essentially which Java crawler could be useful. > > We used to use wget on command line in our publishing process, but we do no longer want to do that. > > Thanks, > Alexander > >