lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ramkumar R. Aiyengar" <andyetitmo...@gmail.com>
Subject Re: Can I reconstruct text from tokens?
Date Wed, 16 Apr 2014 15:59:28 GMT
Logically if you tokenize and put the results in a multivalued field, you
should be able to get all values in sequence?
On 16 Apr 2014 16:51, "Alexandre Rafalovitch" <arafalov@gmail.com> wrote:

> Hello,
>
> If I use very basic tokenizers, e.g. space based and no filters, can I
> reconstruct the text from the tokenized form?
>
> So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?
>
> I know we store enough information, but I don't know internal API
> enough to know what I should be looking at for reconstruction
> algorithm.
>
> Any hints?
>
> The XY problem is that I want to store large amount of very repeatable
> text into Solr. I want the index to be as small as possible, so
> thought if I just pre-tokenized, my dictionary will be quite small.
> And I will be reconstructing some final form anyway.
>
> The other option is to just use compressed fields on stored field, but
> I assume that does not take cross-document efficiencies into account.
> And, it will be a read-only index after build, so I don't care about
> updates messing things up.
>
> Regards,
>    Alex
>
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message