manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-956) Field names are URL encoded
Date Thu, 18 Sep 2014 07:36:34 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138642#comment-14138642
] 

Karl Wright commented on CONNECTORS-956:
----------------------------------------

As for what encoding to use, rather than UTF-8, please read this:

http://grokbase.com/t/lucene/solr-user/135bk8zyzp/solr-4-2-1-behavior-with-field-names-that-use-|-character

The rule is that the field names in Lucene/Solr are what you find in Java identifiers, plus
embedded "." and "-", and can't start with a "$".  This is not enforced, but only these are
guaranteed to work.  For the actual java identifier spec, read this:

http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.8

NOTE WELL that this EXCLUDES field names that include most punctuation, such as ":".

Now, the problem is, should the Solr Connector enforce this in some way, or should we just
let the documents get posted to Solr and let them crash and burn there?  People can filter
fields out using a document transformer now, but for some connectors (e.g. CMIS) it would
be quite a pain to get the field mapping set up correctly.  Looking for ideas on how to make
this work best.



> Field names are URL encoded
> ---------------------------
>
>                 Key: CONNECTORS-956
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-956
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Lucene/SOLR connector
>    Affects Versions: ManifoldCF 1.6.1
>            Reporter: Piergiorgio Lucidi
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The field names provided by some repositories such as Alfresco are based on an URI similar
to:
> {code}
> {http://www.alfresco.org/model/system}store_identifier
> {code}
> But in Solr we found the following field name:
> {code}
> http_3a_2f_2fwww_alfresco_org_2fmodel_2fsystem_2f1_0_7dstore_identifier
> {code}
> The code involved in the Solr connector is the following:
> {code}
> protected static String preEncode(String fieldName)
>   {
>       return URLEncoder.encode(fieldName);
>   }
> {code}
> Probably we should try to solve it removing the preEncode invocation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message