Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DB599DFBA for ; Fri, 14 Sep 2012 09:04:11 +0000 (UTC) Received: (qmail 26832 invoked by uid 500); 14 Sep 2012 09:04:10 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 26587 invoked by uid 500); 14 Sep 2012 09:04:09 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 26491 invoked by uid 99); 14 Sep 2012 09:04:07 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Sep 2012 09:04:07 +0000 Date: Fri, 14 Sep 2012 20:04:07 +1100 (NCT) From: "Markus Jelsma (JIRA)" To: dev@lucene.apache.org Message-ID: <1099103071.79569.1347613447925.JavaMail.jiratomcat@arcas> In-Reply-To: <1415551249.49408.1347012067658.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (SOLR-3808) Extraction contrib to utilize Boilerpipe MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SOLR-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455680#comment-13455680 ] Markus Jelsma commented on SOLR-3808: ------------------------------------- Hi - in Apache Nutch i keep the loaded extractors in a static hashmap. The content handlers have to be wrapped like this and the extractor implementation has to be passed to the BoilerpipeContentHandler constructor, it doesn't use configuration to find an extractor. > Extraction contrib to utilize Boilerpipe > ---------------------------------------- > > Key: SOLR-3808 > URL: https://issues.apache.org/jira/browse/SOLR-3808 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) > Reporter: Markus Jelsma > Priority: Minor > Attachments: SOLR-3808-trunk-1.patch > > > Solr's extraction contrib uses Tika for document parsing and should be able te use Boilerpipe. Tika comes with Boilerpipe, a library capable of removing boilerplate text from HTML pages. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org