lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: How to Index Custom XML structure
Date Tue, 28 Feb 2012 02:56:19 GMT
You might be able to do something with the XSL Transformer step in DIH.

It might also be easier to just write a SolrJ program to parse the XML and
construct a SolrInputDocument to send to Solr. It's really pretty
straightforward.

Best
Erick

On Sun, Feb 26, 2012 at 11:31 PM, Anupam Bhattacharya
<anupamb82@gmail.com> wrote:
> Hi,
>
> I am using ManifoldCF to Crawl data from Documentum repository. I am able
> to successfully read the metadata/properties for the defined document types
> in Documentum using the out-of-the box Documentum Connector in ManifoldCF.
> Unfortunately, there is one XML file also present which consists of a
> custom XML structure which I need to read and fetch the element values and
> add it for indexing in lucene through SOLR.
>
> Is there any mechanism to index any XML structure document in SOLR ?
>
> I checked the SOLR CELL framework which support below stucture..
>
> <add>
>  <doc>
>    <field name="id">9885A004</field>
>    <field name="name">Canon PowerShot SD500</field>
>    <field name="category">camera</field>
>    <field name="features">3x optical zoom</field>
>    <field name="features">aluminum case</field>
>    <field name="weight">6.4</field>
>    <field name="price">329.95</field>
>  </doc>
>  <doc>
>    <field name="id">9885A003</field>
>    <field name="name">Canon PowerShot SD504</field>
>    <field name="category">camera1</field>
>    <field name="features">3x optical zoom1</field>
>    <field name="features">aluminum case1</field>
>    <field name="weight">6.41</field>
>    <field name="price">329.956</field>
>  </doc>
> </add>
>
> & my Custom XML structure is of the following format.. from which I need to
> read *subject *& *abstract *field for indexing. I checked TIKA project but
> I couldn't find any useful stuff.
>
> <?xml version="1.0" encoding="UTF-8"?>
> <RECORD>
> <doc_id>1</doc_id>
> <abstract>This is an abstract.</abstract>
> <subject>Text Subject</subject>
> <availability />
> <indexing>
> <index_group></index_group>
> <keyterms></keyterms>
> <keyterms></keyterms>
> </indexing>
> <publication_date></publication_date>
> <physical_storage />
> <log_entry />
> <legal_category />
> <legal_category_notes />
> <citation_only></citation_only>
> <citation_only_desc />
> <export_control />
> <export_control_desc />
> </RECORD>
>
> Appreciate any help on this.
>
> Regards
> Anupam

Mime
View raw message