Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8928CD724 for ; Tue, 24 Jul 2012 15:41:46 +0000 (UTC) Received: (qmail 1266 invoked by uid 500); 24 Jul 2012 15:41:46 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 1167 invoked by uid 500); 24 Jul 2012 15:41:46 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 1145 invoked by uid 99); 24 Jul 2012 15:41:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Jul 2012 15:41:45 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of daddywri@gmail.com designates 209.85.161.178 as permitted sender) Received: from [209.85.161.178] (HELO mail-gg0-f178.google.com) (209.85.161.178) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Jul 2012 15:41:39 +0000 Received: by ggcq6 with SMTP id q6so6836177ggc.9 for ; Tue, 24 Jul 2012 08:41:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=LFvhZovEo1MqQOjpQAEm23iLlh2YHV/A7PN0OXV53uk=; b=xwH+AydJH4pY7E1Pw8BbFeaaLzi71NEjUpVy8pdfrgYauVwJUiNfc9/OQWH7Z06Wq3 5NEKxY+1xr5DJHF+toop2KW1ZyBOm6oL3ehOGAK7nexix2jFj4ZpHZ792VSGY67WLMlA eGzbdbHQ+hM2Luv8CLxnbZJbbsqJzlA0kM9y4j5o6Zc7WW5HdzUQfPTlblBqwI/qRU01 01rKvV/b4V+TmYgXxBSdh5UhMCLrpwCB3Ap6A95H9xH5HWuNL4GrGYYSjjccFOnqmlbL MM1d3sgmL+d+CFVJ2OLwaPuEqXcyRoYP1/Q76E02+NVK1i4idouOnIp9x+pbBPHs1Vwf U+AQ== MIME-Version: 1.0 Received: by 10.42.38.83 with SMTP id b19mr16118397ice.10.1343144476578; Tue, 24 Jul 2012 08:41:16 -0700 (PDT) Received: by 10.43.93.70 with HTTP; Tue, 24 Jul 2012 08:41:16 -0700 (PDT) In-Reply-To: References: Date: Tue, 24 Jul 2012 11:41:16 -0400 Message-ID: Subject: Re: Crawler output transformation before indexing into Solr From: Karl Wright To: user@manifoldcf.apache.org Content-Type: text/plain; charset=ISO-8859-1 Solr Cell is what you want to use here. It's a tika pipeline that you can configure to modify the data as you need. Karl On Tue, Jul 24, 2012 at 11:35 AM, Arcadius Ahouansou wrote: > > Hello. > > I am currently ManifoldCF 0.6 to crawl and index into Solr4. > > I need to extract data such as locations from the documents into a separate > field before I index into solr. > > - Is there a way this can be done with ManifoldCF? > - If not, is there an output connector allowing to store the content into an > database? Then I coud do the transformation on the DB before indexing. > > Thank you very much. > > Arcadius. > >