manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furkan KAMACI <furkankam...@gmail.com>
Subject Re: Unnecessary Newline Characters and Metadata at Content
Date Sat, 26 Nov 2016 12:26:39 GMT
Hi Shinichiro,

Yes, I can see the content with that way. However, beside the new line
characters, there is metadata information prepended
to content. Everything is OK when you directly send data to Solr without
MFC.

For example one of my content starts with it:

*\n \n stream_size 298979  \n pdf:PDFVersion 1.4  \n X-Parsed-By
org.apache.tika.parser.DefaultParser  \n X-Parsed-By
org.apache.tika.parser.pdf.PDFParser  \n xmp:CreatorTool Google  \n
stream_content_type application/pdf  \n
access_permission:modify_annotations true  \n
access_permission:can_print_degraded true*

I am suspicious about that the way that MFC sends data to Solr. Could you
also check it?

Kind Regards,
Furkan KAMACI

On Sat, Nov 26, 2016 at 2:52 AM, Shinichiro Abe <shinichiro.abe.1@gmail.com>
wrote:

> Hi Furkan,
>
> Please see the previous mail[1] which may be the same issue.
> And as far as I know the new line chars will appear in any Tika
> version and you can see by json format in Solr. When you want to
> remove that, please use charfilter or updateprocessor in Solr. I think
> even when fields have new line chars, searching works, so I don't
> think it is mcf's solrj issue.
>
> [1]http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201610.mbox/%
> 3CCA%2BeTv_UO5DKgza%2Bo0bVQF_i%2B8wtHdz61gP51XHu2gF3rKLn%
> 2BMg%40mail.gmail.com%3E
>
> Shinichiro Abe
>
> 2016-11-26 4:11 GMT+09:00 Karl Wright <daddywri@gmail.com>:
> > I am on vacation today and have other responsibilities.  However, I
> believe
> > Shinichiro Abe might be able to test this out.  He redid the Solr
> > integration for SolrJ 6.3.
> >
> > Thanks,
> > Karl
> >
> >
> > On Fri, Nov 25, 2016 at 1:54 PM, Furkan KAMACI <furkankamaci@gmail.com>
> > wrote:
> >>
> >> Hi Karl,
> >>
> >> Could you try to test MFC with Solr? I cannot see content field either
> >> with Windows Shares or File System with Solr 4.x, 5.x, 6.x. Only Solr
> 4.x
> >> have content and it is as I defined. Code part of sending content as a
> >> stream may have some problems.
> >>
> >> Kind Regards,
> >> Furkan KAMACI
> >>
> >>
> >> On Fri, Nov 25, 2016 at 4:13 PM, Furkan KAMACI <furkankamaci@gmail.com>
> >> wrote:
> >>>
> >>> Hi Karl,
> >>>
> >>> By the way, I've tried different versions of Solr and couldn't get
> >>> content or got as I've explained. When I checkout the MFC trunk which
> uses
> >>> Solr 6.3.0 and when I use Solr 6.3.0 as output connector I can see
> documents
> >>> are indexed but I cannot even see "content" field.
> >>>
> >>> Kind Regards,
> >>> Furkan KAMACI
> >>>
> >>> On Fri, Nov 25, 2016 at 2:01 PM, Karl Wright <daddywri@gmail.com>
> wrote:
> >>>>
> >>>> Hi Furkan,
> >>>>
> >>>> The following code is used to set up a SolrJ object that is then later
> >>>> converted to a post request:
> >>>>
> >>>> >>>>>>
> >>>>     private void buildExtractUpdateHandlerRequest( long length,
> >>>> InputStream is, String contentType,
> >>>>       String contentName,
> >>>>       ContentStreamUpdateRequest contentStreamUpdateRequest )
> >>>>       throws IOException
> >>>>     {
> >>>>       ModifiableSolrParams out = new ModifiableSolrParams();
> >>>>
> >>>>       // Write the id field
> >>>>       writeField(out,LITERAL+idAttributeName,documentURI);
> >>>>       // Write the rest of the attributes
> >>>>       if (originalSizeAttributeName != null)
> >>>>       {
> >>>>         Long size = document.getOriginalSize();
> >>>>         if (size != null)
> >>>>           // Write value
> >>>>
> >>>> writeField(out,LITERAL+originalSizeAttributeName,size.toString());
> >>>>       }
> >>>>       if (modifiedDateAttributeName != null)
> >>>>       {
> >>>>         Date date = document.getModifiedDate();
> >>>>         if (date != null)
> >>>>           // Write value
> >>>>
> >>>> writeField(out,LITERAL+modifiedDateAttributeName,
> DateParser.formatISO8601Date(date));
> >>>>       }
> >>>>       if (createdDateAttributeName != null)
> >>>>       {
> >>>>         Date date = document.getCreatedDate();
> >>>>         if (date != null)
> >>>>           // Write value
> >>>>
> >>>> writeField(out,LITERAL+createdDateAttributeName,
> DateParser.formatISO8601Date(date));
> >>>>       }
> >>>>       if (indexedDateAttributeName != null)
> >>>>       {
> >>>>         Date date = document.getIndexingDate();
> >>>>         if (date != null)
> >>>>           // Write value
> >>>>
> >>>> writeField(out,LITERAL+indexedDateAttributeName,
> DateParser.formatISO8601Date(date));
> >>>>       }
> >>>>       if (fileNameAttributeName != null)
> >>>>       {
> >>>>         String fileName = document.getFileName();
> >>>>         if (!StringUtils.isBlank(fileName))
> >>>>           writeField(out,LITERAL+fileNameAttributeName,fileName);
> >>>>       }
> >>>>       if (mimeTypeAttributeName != null)
> >>>>       {
> >>>>         String mimeType = document.getMimeType();
> >>>>         if (!StringUtils.isBlank(mimeType))
> >>>>           writeField(out,LITERAL+mimeTypeAttributeName,mimeType);
> >>>>       }
> >>>>
> >>>>       // Write the access token information
> >>>>       // Both maps have the same keys.
> >>>>       Iterator<String> typeIterator = aclsMap.keySet().iterator();
> >>>>       while (typeIterator.hasNext())
> >>>>       {
> >>>>         String aclType = typeIterator.next();
> >>>>
> >>>> writeACLs(out,aclType,aclsMap.get(aclType),denyAclsMap.get(aclType));
> >>>>       }
> >>>>
> >>>>       // Write the arguments
> >>>>       for (String name : arguments.keySet())
> >>>>       {
> >>>>         List<String> values = arguments.get(name);
> >>>>         writeField(out,name,values);
> >>>>       }
> >>>>
> >>>>       // Write the metadata, each in a field by itself
> >>>>       buildSolrParamsFromMetadata(out);
> >>>>
> >>>>       // These are unnecessary now in the case of non-solrcloud
> setups,
> >>>> because we overrode the SolrJ posting method to use multipart.
> >>>>       //writeField(out,LITERAL+"stream_size",String.valueOf(length));
> >>>>       //writeField(out,LITERAL+"stream_name",document.getFileName());
> >>>>
> >>>>       // General hint for Tika
> >>>>       if (!StringUtils.isBlank(document.getFileName()))
> >>>>         writeField(out,"resource.name",document.getFileName());
> >>>>
> >>>>       // Write the commitWithin parameter
> >>>>       if (commitWithin != null)
> >>>>         writeField(out,COMMITWITHIN_METADATA,commitWithin);
> >>>>
> >>>>       contentStreamUpdateRequest.setParams(out);
> >>>>
> >>>>       contentStreamUpdateRequest.addContentStream(new
> >>>> RepositoryDocumentStream(is,length,contentType,contentName));
> >>>>     }
> >>>> <<<<<<
> >>>>
> >>>> The ContentStreamUpdateRequest object is defined within SolrJ.
> Normally
> >>>> this would be the end of ManifoldCF involvement, but we have also
> needed to
> >>>> override some SolrJ classes because of bugs.  So it is possible that
> we
> >>>> could fix this behavior if the problem is within the code we have
> changed.
> >>>> However, having said that, I am not sure that the differences you
> report are
> >>>> significant in any way. The w3c spec for multipart HTTP requests is
> what
> >>>> you'd want to look at for that.
> >>>>
> >>>> Please see ModifiedHttpMultipart.java for more details.
> >>>>
> >>>> Thanks,
> >>>> Karl
> >>>>
> >>>>
> >>>> On Fri, Nov 25, 2016 at 5:24 AM, Furkan KAMACI <
> furkankamaci@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Hi Karl,
> >>>>>
> >>>>> I used default values for Solr. At my Solr output connector "Use
the
> >>>>> Extract Update Handler" is clicked. Update handler is defined as:
> >>>>> "/update/extract". There is no Tika content extractor defined at
Job
> >>>>> pipeline.
> >>>>>
> >>>>> I have WireShark captures and logs from both ManifoldCF and Solr.
I
> can
> >>>>> share them if you want.
> >>>>>
> >>>>> Kind Regards,
> >>>>> Furkan KAMACI
> >>>>>
> >>>>> On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <daddywri@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Is this being indexed via the extracting update handler?  What
does
> >>>>>> your pipeline look like?  Is the tika extractor in the pipeline?
> >>>>>>
> >>>>>>
> >>>>>> Karl
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI
> >>>>>> <furkankamaci@gmail.com> wrote:
> >>>>>>>
> >>>>>>> I've indexed a file via ManifoldCF to Solr which has a content
> starts
> >>>>>>> with:
> >>>>>>>
> >>>>>>> 1. Vivien Leigh and Marlon Brando in "A Streetcar Named
Desire"
> >>>>>>> directed by Elia Kazan, 1951
> >>>>>>>
> >>>>>>> 2. Portrait of Marlon Brando for "A Streetcar Named Desire"
> directed
> >>>>>>> by Elia Kazan, 1951
> >>>>>>>
> >>>>>>> 3. Portrait of Marlon Brando for "A Streetcar Named Desire"
> directed
> >>>>>>> by Elia Kazan, 1951
> >>>>>>>
> >>>>>>> However when I check Solr I see that at content:
> >>>>>>>
> >>>>>>>  " \n \nstream_source_info MARLON BRANDO.rtf
>  \nstream_content_type
> >>>>>>> application/rtf   \nstream_size 13580   \nstream_name MARLON
> BRANDO.rtf
> >>>>>>> \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf
>  \n  \n
> >>>>>>> \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named
> Desire\"
> >>>>>>> directed by Elia Kazan \n"
> >>>>>>>
> >>>>>>> There are 2 problems at here.
> >>>>>>>
> >>>>>>> 1) There are newline characters which are unnecessary.
> >>>>>>>
> >>>>>>> 2) There are metadata prepended to content field which should
not
> be.
> >>>>>>>
> >>>>>>> So, one can think that problem maybe at Solr or ManifoldCF
(related
> >>>>>>> to Tika). When I index same document to Solr via cURL there
are
> not new line
> >>>>>>> characters or metadata prepended.
> >>>>>>>
> >>>>>>> What do you think about for a solution?
> >>>>>>>
> >>>>>>> Kind Regards,
> >>>>>>> Furkan KAMACI
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Mime
View raw message