Return-Path: X-Original-To: apmail-lucene-general-archive@www.apache.org Delivered-To: apmail-lucene-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E2BD47DC1 for ; Thu, 24 Nov 2011 01:20:03 +0000 (UTC) Received: (qmail 59936 invoked by uid 500); 24 Nov 2011 01:20:03 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 59845 invoked by uid 500); 24 Nov 2011 01:20:03 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 59836 invoked by uid 99); 24 Nov 2011 01:20:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Nov 2011 01:20:03 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.210.176] (HELO mail-iy0-f176.google.com) (209.85.210.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Nov 2011 01:19:55 +0000 Received: by iaqq3 with SMTP id q3so2161061iaq.35 for ; Wed, 23 Nov 2011 17:19:35 -0800 (PST) Received: by 10.231.63.143 with SMTP id b15mr6893086ibi.14.1322097574981; Wed, 23 Nov 2011 17:19:34 -0800 (PST) Received: from bester.local ([65.78.136.75]) by mx.google.com with ESMTPS id wo4sm44845353igc.5.2011.11.23.17.19.32 (version=SSLv3 cipher=OTHER); Wed, 23 Nov 2011 17:19:33 -0800 (PST) Date: Wed, 23 Nov 2011 17:19:31 -0800 (PST) From: Chris Hostetter To: general@lucene.apache.org Subject: Re: Populating a custom Solr field with text extracted from document In-Reply-To: <1321508548770-3514857.post@n3.nabble.com> Message-ID: References: <1321508548770-3514857.post@n3.nabble.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII : I am a new Solr user, and would like to create a new custom field that is : then populated with text extracted from each document when I crawl my file : system. what are you using to do the crawling? Typically people feed solr structured data -- there are some things in Solr (like the ExtractingRequestHandler) that help you pull structure out of unstructured or semi-structured files, and there are things like DIH that can help you pull data from structure (or semi-structured) sources, but those aren't end-all-be-all solutions to all problems -- they aim to meet the 80/20 rule of simple common cases. If you have special requirements about parsing special files... : text text text... Received : 04 Jan 2002 17:31:40 ...text text text ...you'll need to write your own special code for parsing those files to extract the structure you want. where/how you use your custom code depends on your use cases -- maybe you write a custom extractor for Tika nad then use ExtractingRequestHandler, maybe you write a custom EntityProcessor and then use DataImportHandler, or maybe you just parse the code in the client langauge of your choice and POST it to Solr over HTTP ... it all depends on your use case and what you are comfortable with. BTW: Since you definitely seem to interested in using Solr, you should consider sending subsequent questions to the solr-user@lucene mailing list (general@lucene is generally for discussions about hte overall Lucene project, and/or questions when people really have no idea what they want to use) -Hoss