lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wunderw...@netflix.com>
Subject Re: Indexing XML
Date Fri, 05 Oct 2007 14:22:17 GMT
Solr is not an XML engine (or a MARC engine). It uses XML as an input format
for fielded data. It does not index or search arbitrary XML. You need to
convert your XML into Solr's format.

I would recommend expressing MARC in a Solr schema, then working on the
input XML. The input XML depends on the schema.

If you need an XML engine, I'd recommend MarkLogic (commercial), a very
good product.

wunder

On 10/5/07 12:44 AM, "PAUWELS  Benoit" <Benoit.Pauwels@ulb.ac.be> wrote:

> Hi,
> 
> I wish to index well formed xml documents as they are.
> 
> I have a database filled with MARCXML records. An example of these looks like
> this:
> 
>  
> 
>         <record
> 
>             ns0:schemaLocation="http://www.loc.gov/MARC21/slim
> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"
> 
>             xmlns="http://www.loc.gov/MARC21/slim"
> xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance">
> 
>             <leader>00000nam  22      a 4500</leader>
> 
>             <controlfield tag="001">000500000</controlfield>
> 
>             <controlfield tag="005">20050826220257.0</controlfield>
> 
>             <controlfield tag="008">000710s1998    xx      r     000 0 dut
> d</controlfield>
> 
>             <datafield ind1=" " ind2=" " tag="040">
> 
>                 <subfield code="a">Univ</subfield>
> 
>             </datafield>
> 
>             <datafield ind1="1" ind2=" " tag="100">
> 
>                 <subfield code="a">van Wetten, J. W.</subfield>
> 
>             </datafield>
> 
>             <datafield ind1="1" ind2="3" tag="245">
> 
>                 <subfield code="a">De positie van vrouwen in de asielprocedure
> /</subfield>
> 
>                 <subfield code="c">J.W. van Wetten, N. Dijkhof, F.
> Heide.</subfield>
> 
>             </datafield>
> 
>         </record>
> 
>  
> 
> The idea is to create Lucene indexes on specific MARC fields and store the
> complete MARC record in Lucene 'as is'. In the presentation layer of my
> application I would then have this complete MARC record at hand, and as such
> have full flexibility on which MARC fields to display. So I want to create the
> following record through XSLT and feed this to SOLR.
> 
>  
> 
> <doc>
> 
> <field name="title">De positie van vrouwen in de asielprocedure</field>
> 
> <field name="author">van Wetten, J. W.</field>
> 
> ...
> 
> <field name="originalRecord">
> 
>   <record
> 
>             ns0:schemaLocation="http://www.loc.gov/MARC21/slim
> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"
> 
>             xmlns="http://www.loc.gov/MARC21/slim"
> xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance">
> 
>             <leader>00000nam  22      a 4500</leader>
> 
>             <controlfield tag="001">000500000</controlfield>
> 
>             <controlfield tag="005">20050826220257.0</controlfield>
> 
>             <controlfield tag="008">000710s1998    xx      r     000 0 dut
> d</controlfield>
> 
>             <datafield ind1=" " ind2=" " tag="040">
> 
>                 <subfield code="a">UGent</subfield>
> 
>             </datafield>
> 
>             <datafield ind1="1" ind2=" " tag="100">
> 
>                 <subfield code="a">van Wetten, J. W.</subfield>
> 
>             </datafield>
> 
>             <datafield ind1="1" ind2="3" tag="245">
> 
>                 <subfield code="a">De positie van vrouwen in de asielprocedure
> /</subfield>
> 
>                 <subfield code="c">J.W. van Wetten, N. Dijkhof, F.
> Heide.</subfield>
> 
>             </datafield>
> 
>         </record>
> 
> </field>
> 
> </doc>
> 
>  
> 
> I have the following in my schema.xml:
> 
>  
> 
> <field name="author" type="text" indexed="true" stored="true"
> termVectors="true"/>
> 
> <field name="title" type="text" indexed="true" stored="true"
> termVectors="true"/>
> 
> <field name="originalRecord" type="text" indexed="false" stored="true"/>
> 
>  
> 
>  
> 
> SOLR has of course a problem with the XML in the 'originalRecord' field.
> 
> Is there a solution to this? Has anyone done this before?
> 
>  
> 
> Thanks a lot.
> 
> Benoit.
> 
>  
> 
>  
> 
> =============================
> 
> PAUWELS Benoit
> 
> Université Libre de Bruxelles - Libraries
> 
> Head of Automation
> 
> Av. F.D. Roosevelt 50, CP 180
> 
> 1050 BRUSSELS
> 
> Belgium
> 
> Tel: + 32 2 650 23 91
> 
> Fax: + 32 2 650 23 91
> 
> =============================
> 
>  
> 
>  
> 


Mime
View raw message