lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "DataImportHandler" by NoblePaul
Date Mon, 31 Mar 2008 07:07:21 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by NoblePaul:
http://wiki.apache.org/solr/DataImportHandler

------------------------------------------------------------------------------
  </dataConfig>
  }}}
  
- This data-config is the interesting part. If you read the structure of the Slashdot RSS,
it has a few header elements such as title, link and subject. Those are mapped to the SOLR
fields source, source-link and subject respectively using xpath syntax. The feed also has
multiple ''item'' elements which contain the actual news items.
+ This data-config is the interesting part. If you read the structure of the Slashdot RSS,
it has a few header elements such as title, link and subject. Those are mapped to the SOLR
fields source, source-link and subject respectively using xpath syntax. The feed also has
multiple ''item'' elements which contain the actual news items. So, what we wish to do is
, create a document in SOLR for each 'item'. 
  
- The ''forEach'' attribute in the slashdot ''entity'' contains xpath which tells DataImportHandler
"What are the records that need to be converted into SOLR documents?". As you can see in the
data-config, the forEach="/RDF/channel | /RDF/item" specifies two kinds of records separated
by '|' (OR in standard xpath lexicon). The first one says "Create a SOLR document for each
''channel'' element". The second one says "Create a SOLR document for each ''item'' element".
+ The X!PathEntityprocessor is designed to stream the xml, row by row (Think of a row as various
fields in a xml element ). It uses the ''forEach'' attribute to identify a 'row'. In this
example forEach has the value `'/RDF/channel | /RDF/item'` . This says that this xml has two
types of rows (This uses the xpath syntax for OR and there can be more than one type of rows)
. After it encounters a row , it tries to read as many fields are there in the field declarations.
So in this case, when it reads the row `'/RDF/channel'` it may get 3 fields 'source', 'source-link'
, 'source-subject' . After it processes the row it realizes that it does not have any value
for the 'pk' field so it does not try to create a SOLR document for this row (Even if it tries
it may fail in solr). But all these 3 fields are marked as `commonField="true"` . So it keeps
the values handy for subsequent rows.
  
- But ofcourse, it doesn't make sense to create a SOLR document containing only the header
elements, right? That's what we thought too, therefore we have the ''pk'' attribute in the
slashdot ''entity''. The ''pk=link'' says to DataImportHandler that only if the ''link'' field
is present in the record, then only create a SOLR document for that record. Otherwise, just
move on to the next one. The Slashdot RSS feed has only one ''/RDF/channel'' element present,
therefore there is only record containing the source, source-link and subject fields. Since
this record does not contain the ''link'' field (our pk), no SOLR document is created for
this record and the !EntityProcessor just moves on.
+ It moves ahead and encounters `/RDF/item` and processes the rows one by one . It gets the
values for all the fields except for the 3 fields in the header. But as they were marked as
common fields, the processor puts those fields into the record just before creating the document.
  
- But, we did want to store those header fields, right? Yes, we can do that by adding ''commonField=true''
attribute to the header fields (source, source-link and subject). The ''commonField=true''
says that "store the values for these fields and add them to each SOLR document created".
Therefore, when the processor comes to records of ''/RDF/item'' elements which contain our
pk, it creates a SOLR document for them and adds the header fields to each such document.
+ What about this ''transformer=!DateFormatTransformer'' attribute in the entity? This is
an inbuilt utility transformer helps the user parse his date strings in custom format to 'Date'
objects . Note the field `<field column="date" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss"
/>` . The transformer only applies to a field which has the attribute 'dateTimeFormat'
and it uses the syntax of [ http://java.sun.com/j2se/1.4.2/docs/api/java/text/SimpleDateFormat.html
java's !SimpleDateFormat].
  
- What about this ''transformer=!DateFormatTransformer'' attribute in the entity? Date representation
is always a problem when getting data. Each data source decides to use it's own format for
representing dates but you need to parse it and convert it into a java.util.Date object for
SOLR to index into a date field. Therefore, we supply a transformer called !DateFormatTransformer
which needs you to supply the input format for the date string and we'll do the rest. It uses
java.text.!SimpleDateFormat class internally, so the syntax for dateTimeFormat attribute is
the same as you'd write if you were using !SimpleDateFormat class.
  
+ You can use this feature for indexing from REST API's such as rss/atom feeds, XML data feeds
, other SOLR servers or even well formed xhtml documents . Our XPath support has its limitations
but we have tried to make sure that common use-cases are covered and since it's based on a
streaming parser, it is extremely fast and consumes constant amount of memory even for large
XMLs. Easy, isn't it? And you didn't need to write one line of code! Enjoy :)
- You can use this feature for indexing from REST API's such as RSS/Atom feeds, other SOLR
servers, XML data feeds or Last.FM user profiles! The possibilities are endless. Our XPath
support has its limitations but we have tried to make sure that common use-cases are covered
and since it's based on a streaming parser, it is extremely fast and consumes constant amount
of memory even for large XMLs. Easy, isn't it? And you didn't need to write one line of code!
Enjoy :)
- 
  = Extending the tool with APIs =
  The examples we explored are admittedly, trivial . It is not possible to have all user needs
met by an xml configuration alone. So we expose a few interfaces which can be implemented
by the user to enhance the functionality.
  

Mime
View raw message