lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: Tika analyzers
Date Wed, 30 Jul 2014 15:22:14 GMT
Solr effectively supports only one binary document that gets indexed.
This is because you are not actually indexing the document. You are
extracting metadata (e.g. Author) and content fields out of it and map
it to the "Solr document". So, it makes no sense to have two fields
that are binary because their Meta output will overlap. The actual
"binary" is not actually stored. And not recommended either for
performance reasons.

You may want to think backwards from what you want to find and then
figuring out where that data is coming from. Then, you may end up with
multi-value fields, child documents, flattened documents (e.g.
repeated common metadata), etc. Depending on your real scenario.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Wed, Jul 30, 2014 at 8:00 PM, Tommaso Teofili
<tommaso.teofili@gmail.com> wrote:
> Hi all,
>
> while SolrCell works nicely when in need of indexing binary documents, I am
> wondering about the possibility of having Lucene / Solr documents that have
> binaries in specific Lucene fields, e.g. title="a nice doc",
> name"blabla.doc", binary="0x1234...".
>
> In that case the "binary" field should have an indexing analyzer which can
> extract the text from the binary and index it.
>
> Would it make sense to create a Tika based analyzer for that purpose?
>
> Regards,
> Tommaso

Mime
View raw message