manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Unnecessary Newline Characters and Metadata at Content
Date Fri, 25 Nov 2016 12:01:40 GMT
Hi Furkan,

The following code is used to set up a SolrJ object that is then later
converted to a post request:

>>>>>>
    private void buildExtractUpdateHandlerRequest( long length, InputStream
is, String contentType,
      String contentName,
      ContentStreamUpdateRequest contentStreamUpdateRequest )
      throws IOException
    {
      ModifiableSolrParams out = new ModifiableSolrParams();

      // Write the id field
      writeField(out,LITERAL+idAttributeName,documentURI);
      // Write the rest of the attributes
      if (originalSizeAttributeName != null)
      {
        Long size = document.getOriginalSize();
        if (size != null)
          // Write value
          writeField(out,LITERAL+originalSizeAttributeName,size.toString());
      }
      if (modifiedDateAttributeName != null)
      {
        Date date = document.getModifiedDate();
        if (date != null)
          // Write value

writeField(out,LITERAL+modifiedDateAttributeName,DateParser.formatISO8601Date(date));
      }
      if (createdDateAttributeName != null)
      {
        Date date = document.getCreatedDate();
        if (date != null)
          // Write value

writeField(out,LITERAL+createdDateAttributeName,DateParser.formatISO8601Date(date));
      }
      if (indexedDateAttributeName != null)
      {
        Date date = document.getIndexingDate();
        if (date != null)
          // Write value

writeField(out,LITERAL+indexedDateAttributeName,DateParser.formatISO8601Date(date));
      }
      if (fileNameAttributeName != null)
      {
        String fileName = document.getFileName();
        if (!StringUtils.isBlank(fileName))
          writeField(out,LITERAL+fileNameAttributeName,fileName);
      }
      if (mimeTypeAttributeName != null)
      {
        String mimeType = document.getMimeType();
        if (!StringUtils.isBlank(mimeType))
          writeField(out,LITERAL+mimeTypeAttributeName,mimeType);
      }

      // Write the access token information
      // Both maps have the same keys.
      Iterator<String> typeIterator = aclsMap.keySet().iterator();
      while (typeIterator.hasNext())
      {
        String aclType = typeIterator.next();

writeACLs(out,aclType,aclsMap.get(aclType),denyAclsMap.get(aclType));
      }

      // Write the arguments
      for (String name : arguments.keySet())
      {
        List<String> values = arguments.get(name);
        writeField(out,name,values);
      }

      // Write the metadata, each in a field by itself
      buildSolrParamsFromMetadata(out);

      // These are unnecessary now in the case of non-solrcloud setups,
because we overrode the SolrJ posting method to use multipart.
      //writeField(out,LITERAL+"stream_size",String.valueOf(length));
      //writeField(out,LITERAL+"stream_name",document.getFileName());

      // General hint for Tika
      if (!StringUtils.isBlank(document.getFileName()))
        writeField(out,"resource.name",document.getFileName());

      // Write the commitWithin parameter
      if (commitWithin != null)
        writeField(out,COMMITWITHIN_METADATA,commitWithin);

      contentStreamUpdateRequest.setParams(out);

      contentStreamUpdateRequest.addContentStream(new
RepositoryDocumentStream(is,length,contentType,contentName));
    }
<<<<<<

The ContentStreamUpdateRequest object is defined within SolrJ.  Normally
this would be the end of ManifoldCF involvement, but we have also needed to
override some SolrJ classes because of bugs.  So it is possible that we
could fix this behavior if the problem is within the code we have changed.
However, having said that, I am not sure that the differences you report
are significant in any way. The w3c spec for multipart HTTP requests is
what you'd want to look at for that.

Please see ModifiedHttpMultipart.java for more details.

Thanks,
Karl


On Fri, Nov 25, 2016 at 5:24 AM, Furkan KAMACI <furkankamaci@gmail.com>
wrote:

> Hi Karl,
>
> I used default values for Solr. At my Solr output connector "Use the
> Extract Update Handler" is clicked. Update handler is defined as:
> "/update/extract". There is no Tika content extractor defined at Job
> pipeline.
>
> I have WireShark captures and logs from both ManifoldCF and Solr. I can
> share them if you want.
>
> Kind Regards,
> Furkan KAMACI
>
> On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Is this being indexed via the extracting update handler?  What does your
>> pipeline look like?  Is the tika extractor in the pipeline?
>>
>>
>> Karl
>>
>>
>> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI <furkankamaci@gmail.com>
>> wrote:
>>
>>> I've indexed a file via ManifoldCF to Solr which has a content starts
>>> with:
>>>
>>> *1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire"
>>> directed by Elia Kazan, 1951*
>>>
>>> *2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by
>>> Elia Kazan, 1951*
>>>
>>> *3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by
>>> Elia Kazan, 1951*
>>>
>>> However when I check Solr I see that at content:
>>>
>>> * " \n \nstream_source_info MARLON BRANDO.rtf   \nstream_content_type
>>> application/rtf   \nstream_size 13580   \nstream_name MARLON BRANDO.rtf
>>> \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf   \n  \n
>>> \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\"
>>> directed by Elia Kazan \n"*
>>>
>>> There are 2 problems at here.
>>>
>>> 1) There are newline characters which are unnecessary.
>>>
>>> 2) There are metadata prepended to content field which should not be.
>>>
>>> So, one can think that problem maybe at Solr or ManifoldCF (related to
>>> Tika). When I index same document to Solr via cURL there are not new line
>>> characters or metadata prepended.
>>>
>>> What do you think about for a solution?
>>>
>>> Kind Regards,
>>> Furkan KAMACI
>>>
>>>
>>
>

Mime
View raw message