manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alessandro Benedetti (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-981) Solr Connector - classic Solrj SolrInputDocument support
Date Tue, 24 Jun 2014 18:50:24 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042512#comment-14042512
] 

Alessandro Benedetti commented on CONNECTORS-981:
-------------------------------------------------

So I made some researches, and through SolrJ using the SolrInputDocument is not possible to
send the stream, because if you send the stream the only thing you will index will be the
toString of that stream.

So the only way is to transform the stream into a String and send it within the SolrInputDocument.
I understand your concern, but thinking to the real use case this is what will happen :

1) The Tika connector will parse the file and get the decoded stream in utf-8.
2) If it success, it stores in the RepositoryDocument the textual content in a field, if not,
the stream will remain as a stream ( which means that we don't want to index in Solr)
3) Solr will take the SolrInputDocument from the RepositoryDocument and index it .

Talking about memory the user will be aware that working in this way ( TikaProcessor + Solr
Connector working in operation mode 2 - without extract update ) will cost more memory.

But let's analyse how much memory more ?
Of course with enormous textual file, the memory consumption will be more, but at that point
the user has to simply configure the JVM properly as he will know that he's going to index
big amount of data.
Furthermore, how much parallelism do we have in Document ingestion in the output Connector
? So far I measured almost sequential processing and indexing.

In the end , will be an alternative operative mode, when the user want to first get the content
and maybe do some transformation, or eventually don't want to send via GET the metadata extracted
from the Doc.
And will agree on paying with more memory.
What do you think ?




> Solr Connector - classic Solrj SolrInputDocument support
> --------------------------------------------------------
>
>                 Key: CONNECTORS-981
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-981
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Lucene/SOLR connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Alessandro Benedetti
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>         Attachments: CONNECTORS-981.patch
>
>
> The solr connector, according with the development of the Tika Connector processor, should
be able to operate in 2 ways :
> 1) as usual
> 2) using the classic Solrj SolrInputDocument approach with already extracted metadata
> To allow the choice a flag will be added in the UI in the mapping tab ( as it's related
with how the fields will be processed)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message