lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Massiera (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
Date Fri, 28 Sep 2018 09:36:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631608#comment-16631608
] 

Julien Massiera commented on SOLR-12798:
----------------------------------------

Assuming I have a PDF file which contains an image that can be "OCRized". I have a process
that sends the PDF to a Tika server that will extract the metadata of the PDF file + the text
extracted from the image thanks to Tesseract. At the end of the Tika job, the process retrieve
two elements : a list of metadata as an arraylist and a file containing the text extracted
from the image inside the PDF file. Now, to the metadata list I add the ACLs of the PDF file
(which are hudge) and I need the metadata and the file to be sent as one document to Solr
for indexation.
What are you recommendations in term of code to do this in the most efficient way (in term
of memory consumption and performances of course), using SolrJ ?  And which handler would
you use on Solr side ?
I will test it and see if I experience the URL limit issue

> Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
> ----------------------------------------------------------------------------------------
>
>                 Key: SOLR-12798
>                 URL: https://issues.apache.org/jira/browse/SOLR-12798
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrJ
>    Affects Versions: 7.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: HOT Balloon Trip_Ultra HD.jpg, SOLR-12798-approach.patch, SOLR-12798-reproducer.patch,
SOLR-12798-workaround.patch, no params in url.png, solr-update-request.txt
>
>
> Project ManifoldCF uses SolrJ to post documents to Solr.  When upgrading from SolrJ 7.0.x
to SolrJ 7.4, we encountered significant structural changes to SolrJ's HttpSolrClient class
that seemingly disable any use of multipart post.  This is critical because ManifoldCF's documents
often contain metadata in excess of 4K that therefore cannot be stuffed into a URL.
> The changes in question seem to have been performed by Paul Noble on 10/31/2017, with
the introduction of the RequestWriter mechanism.  Basically, if a request has a RequestWriter,
it is used exclusively to write the request, and that overrides the stream mechanism completely.
 I haven't chased it back to a specific ticket.
> ManifoldCF's usage of SolrJ involves the creation of ContentStreamUpdateRequests for
all posts meant for Solr Cell, and the creation of UpdateRequests for posts not meant for
Solr Cell (as well as for delete and commit requests).  For our release cycle that is taking
place right now, we're shipping a modified version of HttpSolrClient that ignores the RequestWriter
when dealing with ContentStreamUpdateRequests.  We apparently cannot use multipart for all
requests because on the Solr side we get "pfountz Should not get here!" errors on the Solr
side when we do, which generate HTTP error code 500 responses.  That should not happen either,
in my opinion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message