lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Sokolov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-1758) schema definition for configuration files
Date Tue, 24 May 2011 01:50:47 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038355#comment-13038355
] 

Mike Sokolov commented on SOLR-1758:
------------------------------------

This was originally reported in the context of DIH, but as the OP said, it applies equally
well to all configuration.

The config-validation.patch includes changes to Config that validate all XML configuration
files loaded there.  The patch includes a schema with rules for <config/>, <schema>,
<solr/>, <elevate/> and <root/> (used in tests).  It could be extended for
other files as well.  The change causes Config to look in solr.home for a file called config.xsd.
 If found, it is loaded and used to validate whatever configuration file is being loaded.
 If a validation error occurs, an exception is raised (and logged? this seemed to be the way
it was done before, although it seemed odd to me - I'd have thought exception logging would
want to be handled at an outermost layer).

The Solr XML usage seems to be very flexible in practice.  Therefore the schema attempts to
allow a fair amount of flexibility: for elements marked as "plugins" in the Wiki documentation,
I've allowed pretty much arbitrary child content. The wildcards in the schema are "lax" which
means that they allow any element, even unknown elements, but when known elements are found,
they are validated against the model in the schema (eg: <str> is not allowed to have
any child elements).

All the Solr tests but one pass with the patch, which means that the configuration in the
solr example, as well as the various test configurations in solr/src/test-files/solr/conf,
are all valid according to the schema.  The exception is one solrconfig.xml with a
luceneMatchVersion=4.0; I think this should LUCENE_40?  The patch also includes one new test
of an invalid schema; it probably should have a few more.

However, my knowledge of Solr configuration options is far from encyclopedic - I spent a while
with the documentation and examples - and there are almost certainly additional  configuration
options out there that are in use and should be accounted for in the "standard" schema, eg
some elements that should accept any attribute that don't currently.

In general I expect the schema could be evolved to be looser in some areas, and perhaps, tighter
in others.

To help with that, I created some ant rules to convert the schema from Relax NG Compact syntax
to XML Schema.  I find Relax easier to maintain, but including runtime validation support
for Relax would require a large jar to be added to solr.  In this patch is dev-tools/schema;
in there is a config.rnc, which is the source schema, and build.xml which compiles config.xsd
from that using the trang.jar library and copies it into a few
places in the solr source tree.

Some TODOs:

It might be better to have separate schema files for separate configuration documents - this
way the decision to validate could be made on a per-file basis, rather than globally for all
configuration.

There is no model for <highlighting> in the schema - it's just a big wildcard right
now.


> schema definition for configuration files
> -----------------------------------------
>
>                 Key: SOLR-1758
>                 URL: https://issues.apache.org/jira/browse/SOLR-1758
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Jorg Heymans
>         Attachments: config-validation-20110523.patch
>
>
> A schema definition would be able to spot the subtle error in below config 
> {code}
>     <dataSource name="ora" driver="oracle.jdbc.OracleDriver" url="...." />
>     <datasource name="orablob" type="FieldStreamDataSource" />
>     <document name="mydoc">
>         <entity dataSource="ora" name="meta" query="select id, filename, bytes from
documents" >            
>             <field column="ID" name="id" />
>             <field column="FILENAME" name="filename" />
>             <entity dataSource="orablob" processor="TikaEntityProcessor" url="bytes"
dataField="meta.BYTES">
>               <field column="text" name="mainDocument"/>
>             </entity>
>          </entity>
>      </document>
> {code}
> Also, many xml editors support auto completion based on schema definition so it would
be easier to create configuration without constantly having to refer to javadoc or samples
from the distribution.
> This applies equally to schema.xml and solr-config.xml

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message