Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CANGii8fagEXiJqdkK76haycjJON+3nhqt0YX=uGr_6EEPovpoA@mail.gmail.com>
References: 
 <CAEnqZEXMV1NxfnnRR2CgdVs6ZCV6LbG99R5nmoSNKVKJNkUvfw@mail.gmail.com>
	<CAN4YXvdyox5fZ+bk8d-xL6UyRf9oZZT_HTD-8vjXzfPRav=4AQ@mail.gmail.com>
	<CAEnqZEXMAJG6AFwMjWm99UsagGtD8BYp6DJwPhbCiuiT-3GVkw@mail.gmail.com>
	<CANGii8fagEXiJqdkK76haycjJON+3nhqt0YX=uGr_6EEPovpoA@mail.gmail.com>
Date: Wed, 21 Oct 2015 15:52:35 +0200
Message-ID: 
 <CAEnqZEUpDwfXPU9eNz21qyKFAaaQCOXPRhe0CdPZ_ssaw1R37A@mail.gmail.com>
Subject: Re: Efficiency of integer storage/use
From: =?UTF-8?Q?Robert_Kr=C3=BCger?= <krueger@lesspain.de>
To: solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=001a11c38720d945c505229db00e

--001a11c38720d945c505229db00e
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Thanks everyone, for your answers. I will probably make a simple parametric
test pumping a solr index full of those integers with very limited range
and then sorting by vector distances to see how the performance
characteristics are.

On Sun, Oct 18, 2015 at 9:08 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Robert,
> From what I know as inverted index as docvalues compress content much, ev=
en
> stored fields compressed too. So, I think you have much chance to
> experiment successfully. You might need tweak schema disabling storing
> unnecessary info in the index.
>
> On Sat, Oct 17, 2015 at 1:15 AM, Robert Kr=C3=BCger <krueger@lesspain.de>
> wrote:
>
> > Thanks for the feedback.
> >
> > What I am trying to do is to "abuse" integers to store 8bit (or even
> lower)
> > values of metrics I use for content-based image/video search (such as
> > statistical values regarding color distribution) and then implement
> > similarity calculations based on formulas using vector distances. The
> Index
> > can become large (tens of millions of documents each with say 50-100
> > integers  describing the image metrics). I am looking at using a part o=
f
> > those metrics for selecting a subset of images using range queries and
> then
> > more for sorting the result set by relevance.
> >
> > I was first looking at implementing those metrics as binary fields (see
> > other posting) and then use a custom function for the distance
> calculation
> > but so far I got the impression that way is not supported really well b=
y
> > Solr. Base64-En/Decoding would kill performance and implementing a cust=
om
> > field type with all that is probably required for that to work properly
> is
> > currently beyond my Solr knowledge. Besides, using built-in Solr featur=
es
> > makes it easier to finetune/experiment with different approaches,
> because I
> > can just play around with different queries and see what works best,
> > without each time adjusting a custom function.
> >
> > I hope that provides a better picture of what I am trying to achieve.
> >
> > Best,
> >
> > Robert
> >
> > On Fri, Oct 16, 2015 at 4:50 PM, Erick Erickson <erickerickson@gmail.co=
m
> >
> > wrote:
> >
> > > Under the covers, Lucene stores ints in a packed format, so I'd just
> > count
> > > on that for a first pass.
> > >
> > > What is "a lot of integer values"? Hundreds of millions? Billions?
> > > Trillions?
> > >
> > > Unless you give us some indication of scale, it's hard to say anythin=
g
> > > helpful. But unless you have some evidence that your going to blow ou=
t
> > > memory I'd just ignore the "wasted" bits. Especially if you can use
> > > docValues,
> > > that option holds much of the underlying data in MMapDirectory
> > > that uses swappable OS memory....
> > >
> > > Best,
> > > Erick
> > >
> > > On Fri, Oct 16, 2015 at 1:53 AM, Robert Kr=C3=BCger <krueger@lesspain=
.de>
> > > wrote:
> > > > Hi,
> > > >
> > > > I have a data model where I would store and index a lot of integer
> > values
> > > > with a very restricted range (e.g. 0-255), so theoretically the 32
> bits
> > > of
> > > > Solr's integer fields are complete overkill. I want to be able to t=
o
> > > things
> > > > like vector distance calculations on those fields. Should I worry
> about
> > > the
> > > > "wasted" bits or will Solr compress/organize the index in a way tha=
t
> > > > compensates for this if there are only 256 (or even fewer) distinct
> > > values?
> > > >
> > > > Any recommendations on how my fields should be defined to make thin=
gs
> > > like
> > > > numeric functions work as fast as technically possible?
> > > >
> > > > Thanks in advance,
> > > >
> > > > Robert
> > >
> >
> >
> >
> > --
> > Robert Kr=C3=BCger
> > Managing Partner
> > Lesspain GmbH & Co. KG
> >
> > www.lesspain-software.com
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> <mkhludnev@griddynamics.com>
>


--=20
Robert Kr=C3=BCger
Managing Partner
Lesspain GmbH & Co. KG

www.lesspain-software.com

--001a11c38720d945c505229db00e--