Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C190896AB for ; Thu, 24 May 2012 12:37:34 +0000 (UTC) Received: (qmail 13783 invoked by uid 500); 24 May 2012 12:37:31 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 13748 invoked by uid 500); 24 May 2012 12:37:31 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 13736 invoked by uid 99); 24 May 2012 12:37:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 May 2012 12:37:31 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of tobias.wunderlich@igd-r.fraunhofer.de designates 217.64.175.212 as permitted sender) Received: from [217.64.175.212] (HELO mx-relay12-muc.antispameurope.com) (217.64.175.212) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 May 2012 12:37:21 +0000 Received: from EX2.ad.igd.fraunhofer.de (EX2.ad.igd.fraunhofer.de [146.140.10.206]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mailgate.igd.fraunhofer.de (Postfix) with ESMTPS id E4697154C1 for ; Thu, 24 May 2012 14:36:59 +0200 (CEST) Received: from EXMBS1.ad.igd.fraunhofer.de ([169.254.1.155]) by EX2.ad.igd.fraunhofer.de ([146.140.10.206]) with mapi id 14.02.0283.003; Thu, 24 May 2012 14:36:59 +0200 From: "Wunderlich, Tobias" To: "solr-user@lucene.apache.org" Subject: Creating custom Filter / Tokenizer / Request Handler for integration of NER-Framework Thread-Topic: Creating custom Filter / Tokenizer / Request Handler for integration of NER-Framework Thread-Index: Ac05qgw0vMoQoCyKSBOqNWywSLwnGA== Date: Thu, 24 May 2012 12:36:59 +0000 Message-ID: Accept-Language: de-DE, en-US Content-Language: de-DE X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.20.99.230] Content-Type: multipart/alternative; boundary="_000_A967EAEB85AE4D4CB8C6DF4E02A1E3071C249E2CEXMBS1adigdfrau_" MIME-Version: 1.0 X-cloud-security-sender: tobias.wunderlich@igd-r.fraunhofer.de X-cloud-security-recipient: solr-user@lucene.apache.org X-cloud-security-Virusscan: CLEAN X-cloud-security-disclaimer: This E-Mail was scanned by E-Mailservice on mx-gate12-muc with A47C112B4007 X-cloud-security: scantime:.4984 --_000_A967EAEB85AE4D4CB8C6DF4E02A1E3071C249E2CEXMBS1adigdfrau_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hey Guys, I am recently working on a project to integrate a Named-Entity-Recognition-= Framework (NER) in an existing searchplatform based on Solr. The Platform u= ses ManifoldCF to automatically gather the content from various repositorie= s. The NER-Framework creates Annotations/Metadata from given content which = I then want to integrate into the search-platform as metadata to use for fa= ceting. Since MCF handles all content gathering, I need a way to integrate = the NER-Framework directly into Solr. The Goal is to get all Annotations pe= r document into a multivalued field. My first thought was to create a cust= om filter, which just takes the content and gives back only the Annotations= . But as I understand it, a filter only processes predetermined Tokens, wh= ich is useless for my purpose, since the NER-Framework needs to process the= whole content of a document. What about a custom Tokenizer? Would it be po= ssible to process the whole text and give back only the Annotations as Toke= ns? A third thought was to manipulate the ExtractRequestHandler (Solr Cell)= used by MCF to somehow add the Annotations as Metadata when the content an= d metadata is distributed to the different fields. I hope my problem description is sufficient. Does anybody have any thoughts= on that subject? Best regards, Tobias --_000_A967EAEB85AE4D4CB8C6DF4E02A1E3071C249E2CEXMBS1adigdfrau_--