lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ramkumar R. Aiyengar" <>
Subject Re: Can I reconstruct text from tokens?
Date Wed, 16 Apr 2014 15:59:28 GMT
Logically if you tokenize and put the results in a multivalued field, you
should be able to get all values in sequence?
On 16 Apr 2014 16:51, "Alexandre Rafalovitch" <> wrote:

> Hello,
> If I use very basic tokenizers, e.g. space based and no filters, can I
> reconstruct the text from the tokenized form?
> So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?
> I know we store enough information, but I don't know internal API
> enough to know what I should be looking at for reconstruction
> algorithm.
> Any hints?
> The XY problem is that I want to store large amount of very repeatable
> text into Solr. I want the index to be as small as possible, so
> thought if I just pre-tokenized, my dictionary will be quite small.
> And I will be reconstructing some final form anyway.
> The other option is to just use compressed fields on stored field, but
> I assume that does not take cross-document efficiencies into account.
> And, it will be a read-only index after build, so I don't care about
> updates messing things up.
> Regards,
>    Alex
> Personal website:
> Current project: - Accelerating your Solr
> proficiency

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message