manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-956) Field names are URL encoded
Date Thu, 18 Sep 2014 07:17:34 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138630#comment-14138630
] 

Karl Wright commented on CONNECTORS-956:
----------------------------------------

I did some research as to what happens in SolrJ right at the moment.

The key method is SolrServer.request(ContentStreamUpdateRequest cs), which in the non-Solr-Cloud
case we've overridden to fix other bugs to be ModifiedHttpSolrServer (which extends the SolrJ
class org.apache.solr.client.solrj.impl.HttpSolrServer).  What this does for Get, Post, and
multipart Post is as follows:

Get:
{code}
             method = new HttpGet( baseUrl + path + ClientUtils.toQueryString( params, false
) );
{code}
Post:
{code}
                    if (isMultipart) {
                      parts.add(new FormBodyPart(p, new StringBody(v, StandardCharsets.UTF_8)));
                    } else {
                      postParams.add(new BasicNameValuePair(p, v));
                    }
{code}
Multipart:
{code}
                post.setEntity(new UrlEncodedFormEntity(postParams, StandardCharsets.UTF_8));
                ModifiedMultipartEntity entity = new ModifiedMultipartEntity(HttpMultipartMode.STRICT,
null, StandardCharsets.UTF_8);
                for(FormBodyPart p: parts) {
                  entity.addPart(p);
                }
                post.setEntity(entity);
{code}
Not multipart:
{code}
                post.setEntity(new UrlEncodedFormEntity(postParams, StandardCharsets.UTF_8));
{code}

I believe multipart post and post are therefore safe against illegal parameter name characters.
 However, ClientUtils.toQueryString( params, false ) is NOT safe:

{code}
public static String toQueryString( SolrParams params, boolean xml ) {
    StringBuilder sb = new StringBuilder(128);
    try {
      String amp = xml ? "&amp;" : "&";
      boolean first=true;
      Iterator<String> names = params.getParameterNamesIterator();
      while( names.hasNext() ) {
        String key = names.next();
        String[] valarr = params.getParams( key );
        if( valarr == null ) {
          sb.append( first?"?":amp );
          sb.append(key);
          first=false;
        }
        else {
          for (String val : valarr) {
            sb.append( first? "?":amp );
            sb.append(key);
            if( val != null ) {
              sb.append('=');
              sb.append( URLEncoder.encode( val, "UTF-8" ) );
            }
            first=false;
          }
        }
      }
    }
    catch (IOException e) {throw new RuntimeException(e);}  // can't happen
    return sb.toString();
  }
{code}

I can't override that method, because it's a static and multiple places call it.  The best
I can do is override the solr server classes that make use of it.  That may or may not work;
the derivation of (say) org.apache.solr.client.solrj.impl.CloudSolrServer is complex.  The
concern is that we don't control that flow, for the most part, although posts, gets, and multipart
posts *do* still go through our ModifledHttpSolrServer class.

What I propose to do is to break backwards compatibility in trunk, since it's ManifoldCF 2.0
anyway and that is allowed.  If the change seems to work there, we can talk about adding a
switch in the dev_1x branch.


> Field names are URL encoded
> ---------------------------
>
>                 Key: CONNECTORS-956
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-956
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Lucene/SOLR connector
>    Affects Versions: ManifoldCF 1.6.1
>            Reporter: Piergiorgio Lucidi
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The field names provided by some repositories such as Alfresco are based on an URI similar
to:
> {code}
> {http://www.alfresco.org/model/system}store_identifier
> {code}
> But in Solr we found the following field name:
> {code}
> http_3a_2f_2fwww_alfresco_org_2fmodel_2fsystem_2f1_0_7dstore_identifier
> {code}
> The code involved in the Solr connector is the following:
> {code}
> protected static String preEncode(String fieldName)
>   {
>       return URLEncoder.encode(fieldName);
>   }
> {code}
> Probably we should try to solve it removing the preEncode invocation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message