Return-Path: Delivered-To: apmail-lucene-solr-commits-archive@locus.apache.org Received: (qmail 74929 invoked from network); 31 Mar 2008 09:37:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 31 Mar 2008 09:37:43 -0000 Received: (qmail 90470 invoked by uid 500); 31 Mar 2008 09:37:42 -0000 Delivered-To: apmail-lucene-solr-commits-archive@lucene.apache.org Received: (qmail 90442 invoked by uid 500); 31 Mar 2008 09:37:42 -0000 Mailing-List: contact solr-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-dev@lucene.apache.org Delivered-To: mailing list solr-commits@lucene.apache.org Received: (qmail 90433 invoked by uid 99); 31 Mar 2008 09:37:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Mar 2008 02:37:42 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Mar 2008 09:37:10 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 0961ED2ED for ; Mon, 31 Mar 2008 09:37:22 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: solr-commits@lucene.apache.org Date: Mon, 31 Mar 2008 09:37:22 -0000 Message-ID: <20080331093722.15476.43137@eos.apache.org> Subject: [Solr Wiki] Update of "DataImportHandler" by NoblePaul X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification. The following page has been changed by NoblePaul: http://wiki.apache.org/solr/DataImportHandler ------------------------------------------------------------------------------ }}} - This data-config is the interesting part. If you read the structure of the Slashdot RSS, it has a few header elements such as title, link and subject. Those are mapped to the SOLR fields source, source-link and subject respectively using xpath syntax. The feed also has multiple ''item'' elements which contain the actual news items. So, what we wish to do is , create a document in SOLR for each 'item'. + This data-config is where the action is. If you read the structure of the Slashdot RSS, it has a few header elements such as title, link and subject. Those are mapped to the SOLR fields source, source-link and subject respectively using xpath syntax. The feed also has multiple ''item'' elements which contain the actual news items. So, what we wish to do is , create a document in SOLR for each 'item'. The X!PathEntityprocessor is designed to stream the xml, row by row (Think of a row as various fields in a xml element ). It uses the ''forEach'' attribute to identify a 'row'. In this example forEach has the value `'/RDF/channel | /RDF/item'` . This says that this xml has two types of rows (This uses the xpath syntax for OR and there can be more than one type of rows) . After it encounters a row , it tries to read as many fields are there in the field declarations. So in this case, when it reads the row `'/RDF/channel'` it may get 3 fields 'source', 'source-link' , 'source-subject' . After it processes the row it realizes that it does not have any value for the 'pk' field so it does not try to create a SOLR document for this row (Even if it tries it may fail in solr). But all these 3 fields are marked as `commonField="true"` . So it keeps the values handy for subsequent rows. @@ -382, +382 @@ What about this ''transformer=!DateFormatTransformer'' attribute in the entity? This is an inbuilt utility transformer helps the user parse his date strings in custom format to 'Date' objects . Note the field `` . The transformer only applies to a field which has the attribute 'dateTimeFormat' and it uses the syntax of java's [http://java.sun.com/j2se/1.4.2/docs/api/java/text/SimpleDateFormat.html SimpleDateFormat]. - You can use this feature for indexing from REST API's such as rss/atom feeds, XML data feeds , other SOLR servers or even well formed xhtml documents . Our XPath support has its limitations but we have tried to make sure that common use-cases are covered and since it's based on a streaming parser, it is extremely fast and consumes constant amount of memory even for large XMLs. Easy, isn't it? And you didn't need to write one line of code! Enjoy :) + You can use this feature for indexing from REST API's such as rss/atom feeds, XML data feeds , other SOLR servers or even well formed xhtml documents . Our XPath support has its limitations but we have tried to make sure that common use-cases are covered and since it's based on a streaming parser, it is extremely fast and consumes constant amount of memory even for large XMLs. It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is `''` the mapping should just contain `'subject'`).Easy, isn't it? And you didn't need to write one line of code! Enjoy :) = Extending the tool with APIs = The examples we explored are admittedly, trivial . It is not possible to have all user needs met by an xml configuration alone. So we expose a few interfaces which can be implemented by the user to enhance the functionality. @@ -418, +418 @@ So there is no compile-time dependency on the !DataImportHandler API - The configuration has a 'flexible' schema. It lets the user provide arbitrary attributes in an 'entity' tag and 'field' tags. The tool reads the data and hands it over to the implementation class as it is. If the 'Transformer' needs extra information to be provided on a per entity/field basis it can do so. The values can be obtained from the Context. + The configuration has a 'flexible' schema. It lets the user provide arbitrary attributes in an 'entity' tag and 'field' tags. The tool reads the data and hands it over to the implementation class as it is. If the 'Transformer' needs extra information to be provided on a per entity/field basis it can get them from the context. + + === RegexTransformer === There is an inbuilt transformer called '!RegexTransfromer' provided with the tool itself. It helps in extracting values from fields (from db) using Regular Expressions. The actual class name is `org.apache.solr.handler.dataimport.RegexTransformer` . But as it belongs to the default package , package-name can be omitted