manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Indexing JDBC data
Date Fri, 19 Sep 2014 08:57:03 GMT
I've looked at the ElasticSearch connector code.  Bear in mind that this
connector is a contribution from folks who know more about ElasticSearch
than I do. The base64 encoding is apparently part of the design, because it
is presumed that transmission of binary documents to ES is necessary:

>>>>>>
        pw.print("{");
        Iterator<String> i = document.getFields();
        boolean needComma = false;
        while (i.hasNext()){
          String fieldName = i.next();
          String[] fieldValues = document.getFieldAsStrings(fieldName);
          needComma = writeField(pw, needComma, fieldName, fieldValues);
        }

        needComma = writeACLs(pw, needComma, "document", acls, denyAcls);
        needComma = writeACLs(pw, needComma, "share", shareAcls,
shareDenyAcls);
        needComma = writeACLs(pw, needComma, "parent", parentAcls,
parentDenyAcls);

        if(inputStream!=null){
          if(needComma){
            pw.print(",");
          }
          // I'm told this is not necessary: see CONNECTORS-690
          //pw.print("\"type\" : \"attachment\",");
          pw.print("\"file\" : {");
          String contentType = document.getMimeType();
          if (contentType != null)
            pw.print("\"_content_type\" :
"+jsonStringEscape(contentType)+",");
          String fileName = document.getFileName();
          if (fileName != null)
            pw.print("\"_name\" : "+jsonStringEscape(fileName)+",");
          pw.print(" \"content\" : \"");
          Base64 base64 = new Base64();
          base64.encodeStream(inputStream, pw);
          pw.print("\"}");
        }

        pw.print("}");
<<<<<<

If you think this is incorrect, please let me know.  We *could*, for
example, require that the ES connector only be handed text documents, and
thus all extraction of binary would have to be done with a Tika
transformation connection in the pipeline.

I don't see any way, though, that this connector can confuse one field with
another.  I'd go back to your original query to see if that's possible at
that level.

Thanks!
Karl


On Fri, Sep 19, 2014 at 4:43 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Jens,
>
> The queries look correct.
> If you try indexing some small amount of file content through the same
> output connection (using, say, the Filesystem connector), do you see the
> same thing?  I would bet so; if that's the case, something is clearly wrong
> with either how your elasticsearch connector is configured, or there's a
> bug.
>
> Karl
>
>
> On Fri, Sep 19, 2014 at 3:26 AM, Jens Jahnke <jens@wegtam.com> wrote:
>
>> Hi,
>>
>> I'm new to manifold and I try to index some stuff from a mysql db via
>> jdbc into an elasticsearch index.
>>
>> So far I've used the simple mapping example from the user docs for
>> elasticsearch and I've the following query for data collection:
>>
>> SELECT id AS $(IDCOLUMN),
>> CONCAT("http://my.base.url/show.html?record=", id) AS $(URLCOLUMN),
>> CONCAT(name, " ", description, " ", what_ever) AS $(DATACOLUMN)
>> FROM accounts WHERE id IN $(IDLIST)
>>
>> If I run the indexing job the data is fetched from the db and stored
>> into elasticsearch. But I've noticed 2 things:
>>
>> 1. The actual content field in the mapping is base64 encoded and
>> therefore not searchable?
>>
>> 2. If I do a base64decode on the content field then I see that it the
>> value from URLCOLUMN but not the one from DATACOLUMN.
>>
>> Can anyone shed some light on this?
>>
>> Regards,
>>
>> Jens
>>
>> --
>> 19. Scheiding 2014, 09:15
>> Homepage : http://www.wegtam.com
>>
>> Integrity has no need for rules.
>>
>
>

Mime
View raw message