lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shalin Shekhar Mangar <shalinman...@gmail.com>
Subject Re: Question on modifying solr behavior on indexing xml files..
Date Fri, 02 Oct 2009 12:49:02 GMT
On Thu, Oct 1, 2009 at 3:10 PM, Thung, Peter C CIV SPAWARSYSCEN-PACIFIC,
56340 <peter.thung@navy.mil> wrote:

> 1.  In my playing around with
> sending in an XML document within a an XML CDATA tag,
> with termVectors="true"
>
> I noticed the following behavior:
> <person>peter</person>
> collapses to the term
> personpeterperson
> instead of
> person
> and
> peter separately.
>
> I realize I could try and do a search and replaces of characters like
> <>"=  to a space so that the default parser/indexer can preserve element
> names.
> However, I'm wondering if someon could point me to where one might do
> this withing
> the solr or apache lucene code as a proper plug in with maybe an example
> that I could use
> as a template.  Also where in the solrconfig.xml file I would want to
> change to reference the new parser.
>
>
Solr is agnostic of the content in a schema field. It does not know that it
is XML and hence it will do blind tokenization/filtering as defined for the
field type in schema.xml

If all you want is to do a full-text search on words found somewhere in that
XML, then your approach of replacing <>"= to a space will work fine. You can
use the PatternReplaceFilter and specify a regex which matches these special
characters and replaces them by a space.

<filter class="solr.PatternReplaceFilterFactory" pattern="([<>="])"
replacement=" " replace="all"/>

Or you can use the MappingCharFilter (solr 1.4 feature) and specify a
mapping file which has these special characters mapped to a space.

<charFilter class="solr.MappingCharFilterFactory"
mapping="special-xml-symbols.txt"/>

The file should be in the format:
characterToBeReplaced => replacementChar

However, if you want to preserve the structure of the XML document, it is
best to parse it out yourself and put contents into Solr fields before
sending it to Solr. You may also want to look at DataImportHandler and
XPathEntityProcessor which is commonly used for importing XML files.

http://wiki.apache.org/solr/DataImportHandler


> 2.  My other question would also be if this technique would work for XML
> type messages embedded
> in Microsoft Excel, or Powerpoint presentations where I would like to
> preserve knowining xml element term frequencies
> where I would try and leverage the component that automatically indexes
> microsoft documents.
> Would I need to modify that component and customize it?
>
>
Perhaps somebody who knows about Solr Cell can answer this but I think it
should work.

-- 
Regards,
Shalin Shekhar Mangar.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message