manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furkan KAMACI <furkankam...@gmail.com>
Subject Re: Unnecessary Newline Characters and Metadata at Content
Date Fri, 25 Nov 2016 14:13:12 GMT
Hi Karl,

By the way, I've tried different versions of Solr and couldn't get content
or got as I've explained. When I checkout the MFC trunk which uses Solr
6.3.0 and when I use Solr 6.3.0 as output connector I can see documents are
indexed but I cannot even see "content" field.

Kind Regards,
Furkan KAMACI

On Fri, Nov 25, 2016 at 2:01 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Furkan,
>
> The following code is used to set up a SolrJ object that is then later
> converted to a post request:
>
> >>>>>>
>     private void buildExtractUpdateHandlerRequest( long length,
> InputStream is, String contentType,
>       String contentName,
>       ContentStreamUpdateRequest contentStreamUpdateRequest )
>       throws IOException
>     {
>       ModifiableSolrParams out = new ModifiableSolrParams();
>
>       // Write the id field
>       writeField(out,LITERAL+idAttributeName,documentURI);
>       // Write the rest of the attributes
>       if (originalSizeAttributeName != null)
>       {
>         Long size = document.getOriginalSize();
>         if (size != null)
>           // Write value
>           writeField(out,LITERAL+originalSizeAttributeName,
> size.toString());
>       }
>       if (modifiedDateAttributeName != null)
>       {
>         Date date = document.getModifiedDate();
>         if (date != null)
>           // Write value
>           writeField(out,LITERAL+modifiedDateAttributeName,
> DateParser.formatISO8601Date(date));
>       }
>       if (createdDateAttributeName != null)
>       {
>         Date date = document.getCreatedDate();
>         if (date != null)
>           // Write value
>           writeField(out,LITERAL+createdDateAttributeName,
> DateParser.formatISO8601Date(date));
>       }
>       if (indexedDateAttributeName != null)
>       {
>         Date date = document.getIndexingDate();
>         if (date != null)
>           // Write value
>           writeField(out,LITERAL+indexedDateAttributeName,
> DateParser.formatISO8601Date(date));
>       }
>       if (fileNameAttributeName != null)
>       {
>         String fileName = document.getFileName();
>         if (!StringUtils.isBlank(fileName))
>           writeField(out,LITERAL+fileNameAttributeName,fileName);
>       }
>       if (mimeTypeAttributeName != null)
>       {
>         String mimeType = document.getMimeType();
>         if (!StringUtils.isBlank(mimeType))
>           writeField(out,LITERAL+mimeTypeAttributeName,mimeType);
>       }
>
>       // Write the access token information
>       // Both maps have the same keys.
>       Iterator<String> typeIterator = aclsMap.keySet().iterator();
>       while (typeIterator.hasNext())
>       {
>         String aclType = typeIterator.next();
>         writeACLs(out,aclType,aclsMap.get(aclType),denyAclsMap.get(
> aclType));
>       }
>
>       // Write the arguments
>       for (String name : arguments.keySet())
>       {
>         List<String> values = arguments.get(name);
>         writeField(out,name,values);
>       }
>
>       // Write the metadata, each in a field by itself
>       buildSolrParamsFromMetadata(out);
>
>       // These are unnecessary now in the case of non-solrcloud setups,
> because we overrode the SolrJ posting method to use multipart.
>       //writeField(out,LITERAL+"stream_size",String.valueOf(length));
>       //writeField(out,LITERAL+"stream_name",document.getFileName());
>
>       // General hint for Tika
>       if (!StringUtils.isBlank(document.getFileName()))
>         writeField(out,"resource.name",document.getFileName());
>
>       // Write the commitWithin parameter
>       if (commitWithin != null)
>         writeField(out,COMMITWITHIN_METADATA,commitWithin);
>
>       contentStreamUpdateRequest.setParams(out);
>
>       contentStreamUpdateRequest.addContentStream(new
> RepositoryDocumentStream(is,length,contentType,contentName));
>     }
> <<<<<<
>
> The ContentStreamUpdateRequest object is defined within SolrJ.  Normally
> this would be the end of ManifoldCF involvement, but we have also needed to
> override some SolrJ classes because of bugs.  So it is possible that we
> could fix this behavior if the problem is within the code we have changed.
> However, having said that, I am not sure that the differences you report
> are significant in any way. The w3c spec for multipart HTTP requests is
> what you'd want to look at for that.
>
> Please see ModifiedHttpMultipart.java for more details.
>
> Thanks,
> Karl
>
>
> On Fri, Nov 25, 2016 at 5:24 AM, Furkan KAMACI <furkankamaci@gmail.com>
> wrote:
>
>> Hi Karl,
>>
>> I used default values for Solr. At my Solr output connector "Use the
>> Extract Update Handler" is clicked. Update handler is defined as:
>> "/update/extract". There is no Tika content extractor defined at Job
>> pipeline.
>>
>> I have WireShark captures and logs from both ManifoldCF and Solr. I can
>> share them if you want.
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>> On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Is this being indexed via the extracting update handler?  What does your
>>> pipeline look like?  Is the tika extractor in the pipeline?
>>>
>>>
>>> Karl
>>>
>>>
>>> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI <furkankamaci@gmail.com>
>>> wrote:
>>>
>>>> I've indexed a file via ManifoldCF to Solr which has a content starts
>>>> with:
>>>>
>>>> *1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire"
>>>> directed by Elia Kazan, 1951*
>>>>
>>>> *2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed
>>>> by Elia Kazan, 1951*
>>>>
>>>> *3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed
>>>> by Elia Kazan, 1951*
>>>>
>>>> However when I check Solr I see that at content:
>>>>
>>>> * " \n \nstream_source_info MARLON BRANDO.rtf   \nstream_content_type
>>>> application/rtf   \nstream_size 13580   \nstream_name MARLON BRANDO.rtf
>>>> \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf   \n  \n
>>>> \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\"
>>>> directed by Elia Kazan \n"*
>>>>
>>>> There are 2 problems at here.
>>>>
>>>> 1) There are newline characters which are unnecessary.
>>>>
>>>> 2) There are metadata prepended to content field which should not be.
>>>>
>>>> So, one can think that problem maybe at Solr or ManifoldCF (related to
>>>> Tika). When I index same document to Solr via cURL there are not new line
>>>> characters or metadata prepended.
>>>>
>>>> What do you think about for a solution?
>>>>
>>>> Kind Regards,
>>>> Furkan KAMACI
>>>>
>>>>
>>>
>>
>

Mime
View raw message