lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dirk Rudolph (JIRA)" <>
Subject [jira] [Commented] (SOLR-11869) Remote streaming UpdateRequestProcessor
Date Thu, 18 Jan 2018 14:15:00 GMT


Dirk Rudolph commented on SOLR-11869:

I see. So I will start without taking care of the document being fully read into memory or

Anyway, would that kind of UpdateRequestProcessor be interesting for solr or am I the only
one facing that use case?

> Remote streaming UpdateRequestProcessor
> ---------------------------------------
>                 Key: SOLR-11869
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: UpdateRequestProcessors
>            Reporter: Dirk Rudolph
>            Priority: Minor
> When indexing documents from content management systems (or digital asset management
systems) they usually have fields for metadata given by an editor and they in case of pdfs,
docx or any other text formats may also contain the binary content as well, which might be
parsed to plain text using tika. This is whats currently supported by the ExtractingRequestHandler. 
> We are now facing situations where we are indexing batches of documents using the UpdateRequestHandler
and want to send the binary content of the documents mentioned above as part of the single
request to the UpdateRequestHandler. As those documents might be of unknown size and its
difficult to send streams along the wire with javax.json APIs, I though about sending the
url to the document itself, let solr fetch the document and let it be parsed by tika -
using a RemoteStreamingUpdateRequestProcessor.  
> Example:
> {code:json}
> { 
>  "add": { "id": "doc1", "meta": "foo", "meta": "bar", "text": "Short text" }
>  "add": { "id": "doc2", "meta": "will become long", "text_ref": "http://..." }
> }
> {code}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message