lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ramkumar R. Aiyengar" <andyetitmo...@gmail.com>
Subject Re: Can I reconstruct text from tokens?
Date Fri, 18 Apr 2014 19:36:32 GMT
Sorry, didn't think this through. You're right, still the same problem..
On 16 Apr 2014 17:40, "Alexandre Rafalovitch" <arafalov@gmail.com> wrote:

> Why? I want stored=false, at which point multivalued field is just offset
> values in the dictionary. Still have to reconstruct from offsets.
>
> Or am I missing something?
>
> Regards,
>      Alex
> On 16/04/2014 10:59 pm, "Ramkumar R. Aiyengar" <andyetitmoves@gmail.com>
> wrote:
>
> > Logically if you tokenize and put the results in a multivalued field, you
> > should be able to get all values in sequence?
> > On 16 Apr 2014 16:51, "Alexandre Rafalovitch" <arafalov@gmail.com>
> wrote:
> >
> > > Hello,
> > >
> > > If I use very basic tokenizers, e.g. space based and no filters, can I
> > > reconstruct the text from the tokenized form?
> > >
> > > So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?
> > >
> > > I know we store enough information, but I don't know internal API
> > > enough to know what I should be looking at for reconstruction
> > > algorithm.
> > >
> > > Any hints?
> > >
> > > The XY problem is that I want to store large amount of very repeatable
> > > text into Solr. I want the index to be as small as possible, so
> > > thought if I just pre-tokenized, my dictionary will be quite small.
> > > And I will be reconstructing some final form anyway.
> > >
> > > The other option is to just use compressed fields on stored field, but
> > > I assume that does not take cross-document efficiencies into account.
> > > And, it will be a read-only index after build, so I don't care about
> > > updates messing things up.
> > >
> > > Regards,
> > >    Alex
> > >
> > > Personal website: http://www.outerthoughts.com/
> > > Current project: http://www.solr-start.com/ - Accelerating your Solr
> > > proficiency
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message