Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2A2B217216 for ; Thu, 23 Oct 2014 06:03:47 +0000 (UTC) Received: (qmail 36759 invoked by uid 500); 23 Oct 2014 06:03:47 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 36707 invoked by uid 500); 23 Oct 2014 06:03:47 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 36697 invoked by uid 99); 23 Oct 2014 06:03:47 -0000 Received: from mx1-us-east.apache.org (HELO mx1-us-east.apache.org) (54.164.171.186) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Oct 2014 06:03:46 +0000 Received: from mx1-us-east.apache.org (localhost [127.0.0.1]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTP id 2C5BC42984 for ; Thu, 23 Oct 2014 06:04:10 +0000 (UTC) Received: by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org, from userid 111) id 214A1435BE; Thu, 23 Oct 2014 06:04:10 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on mx1-us-east.apache.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=10.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_ENVFROM_END_DIGIT,RCVD_IN_DNSWL_LOW,RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL,SPF_PASS,URIBL_BLOCKED autolearn=disabled version=3.4.0 Received: from mail-pa0-f45.google.com (mail-pa0-f45.google.com [209.85.220.45]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id A4C0342984 for ; Thu, 23 Oct 2014 06:04:09 +0000 (UTC) Received: by mail-pa0-f45.google.com with SMTP id lj1so466240pab.4 for ; Wed, 22 Oct 2014 23:03:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date:reply-to :content-transfer-encoding:message-id:references:to; bh=rjtCMXg/9K7do9ROnhXFrgL17WoAOvzKRZatTJorcS4=; b=NPFBa9fhf1V7h5goh4alaXb2jpnJ2LHQDjk26te1fXscztoYCAIVOmr0DsluSVRS7w +E4qYNuDDKkiQaRPcnIpdPyEm0HPPiq2PlgQYrGPcLGfBBLfDAXtruhYt154q3DjMQIZ +/Dp1J1beM+lYcdzUuo0qrLF1II10rtnvv8uF0bRq+HIcBR6hKRy6h/0dXiZ/WhG2StZ lilV0dIHb1sk/8oWg9LejkGTsFH7CWvBMLA0rGRKQR1m2TwUAmDdorkUXJh2HKUcRNQY gbUZDzaCe4YHHOiaGfGivtGzmVNSkK3I5DioXPGN/Mvq4sGdpiDO4jmsusE5nZgzN4uc cc9Q== X-Received: by 10.70.49.68 with SMTP id s4mr3076676pdn.6.1414044224856; Wed, 22 Oct 2014 23:03:44 -0700 (PDT) Received: from [192.168.1.9] (y073164.dynamic.ppp.asahi-net.or.jp. [118.243.73.164]) by mx.google.com with ESMTPSA id ql5sm689278pbc.3.2014.10.22.23.03.43 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 22 Oct 2014 23:03:44 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: Extracting Content from Web Crawler using the new PipeLine From: Shinichiro Abe In-Reply-To: Date: Thu, 23 Oct 2014 15:03:41 +0900 Reply-To: "user@manifoldcf.apache.org" Content-Transfer-Encoding: quoted-printable Message-Id: References: To: user@manifoldcf.apache.org X-Mailer: Apple Mail (2.1508) X-Virus-Scanned: ClamAV using ClamSMTP Hi Arcadius, > - use Tika's BoilerPipe to get cleaner content from web sites? Yes, Tika extractor will remove tags in html and send content and metadata to downstream pipeline/output connection. > - What about extracting specific HTML tags such as all h1 or h2 and = map them to a Solr field? No, currently it can map only metadata which is extracted by Tika to = Solr field. For h1, h2, p tags etc, Tika extractor doesn't capture them and doesn't = treat them as metadata. Currently when capturing these tags and map them to fields,=20 we have to use Solr's ExtractingRequestHandler(CAPTURE_ELEMENTS param). Regards, Shinichiro Abe On 2014/10/23, at 10:21, Arcadius Ahouansou = wrote: >=20 > Hello. >=20 > Given that we now have pipelines in ManifoldCF, How feasible is it = to: >=20 > - use Tika's BoilerPipe to get cleaner content from web sites? > - What about extracting specific HTML tags such as all h1 or h2 and = map them to a Solr field? >=20 > Thank you very much. >=20 > Arcadius. > =20