Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6C18017FC6 for ; Wed, 21 Oct 2015 13:52:53 +0000 (UTC) Received: (qmail 40940 invoked by uid 500); 21 Oct 2015 13:52:49 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 40873 invoked by uid 500); 21 Oct 2015 13:52:49 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 40862 invoked by uid 99); 21 Oct 2015 13:52:48 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Oct 2015 13:52:48 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 6987118099A for ; Wed, 21 Oct 2015 13:52:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.001 X-Spam-Level: **** X-Spam-Status: No, score=4.001 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, KAM_LAZY_DOMAIN_SECURITY=1, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id uEGsRfD_ZjSB for ; Wed, 21 Oct 2015 13:52:37 +0000 (UTC) Received: from mail-wi0-f172.google.com (mail-wi0-f172.google.com [209.85.212.172]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 57597206A7 for ; Wed, 21 Oct 2015 13:52:37 +0000 (UTC) Received: by wicll6 with SMTP id ll6so75610482wic.1 for ; Wed, 21 Oct 2015 06:52:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=2yLygPzYzop3XiGJqWcghTN9udyLcz6oCDJh0XHu/IA=; b=K9HZ7hqFSL4sSI+f2/l6LG0OVDEqunm3jjF526Tnfxbt0klaCLrGL9Awv45d/3WZhD EPpddknJTdXAlJLK9axKYT2n3BejQrdLvINN2Zty+dDKg/WyzmbdlQsrqMuTBy06uIKv ZOjJ8uvoA3MsSzwl1ZYbncSk1hF5J2XikT4lG3lKBwMCB8GtqayhUoIICDgih1m0aa/t TueML/95TpOMu4/9SyCtyK+8tmXy98Z6Vr8AywGIxoAbjfBssOPs/bbU0GPrJl8MOG2w /K/ctrkLhvBDTH50GwhrZemURQVkJySrXsxlYZbry+TLoe7Wb9n+r8bJjTxP2ztQm9Bp Ofdg== X-Gm-Message-State: ALoCoQl7TuZvkWW2p5o8Jw/Uvl7HP5eA+UCA3YLEZ3RqwuPeMbBdIbuh27bVnabmfb6o9Ztz8/KL MIME-Version: 1.0 X-Received: by 10.180.205.198 with SMTP id li6mr97482wic.63.1445435555784; Wed, 21 Oct 2015 06:52:35 -0700 (PDT) Received: by 10.194.41.36 with HTTP; Wed, 21 Oct 2015 06:52:35 -0700 (PDT) In-Reply-To: References: Date: Wed, 21 Oct 2015 15:52:35 +0200 Message-ID: Subject: Re: Efficiency of integer storage/use From: =?UTF-8?Q?Robert_Kr=C3=BCger?= To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001a11c38720d945c505229db00e --001a11c38720d945c505229db00e Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks everyone, for your answers. I will probably make a simple parametric test pumping a solr index full of those integers with very limited range and then sorting by vector distances to see how the performance characteristics are. On Sun, Oct 18, 2015 at 9:08 PM, Mikhail Khludnev < mkhludnev@griddynamics.com> wrote: > Robert, > From what I know as inverted index as docvalues compress content much, ev= en > stored fields compressed too. So, I think you have much chance to > experiment successfully. You might need tweak schema disabling storing > unnecessary info in the index. > > On Sat, Oct 17, 2015 at 1:15 AM, Robert Kr=C3=BCger > wrote: > > > Thanks for the feedback. > > > > What I am trying to do is to "abuse" integers to store 8bit (or even > lower) > > values of metrics I use for content-based image/video search (such as > > statistical values regarding color distribution) and then implement > > similarity calculations based on formulas using vector distances. The > Index > > can become large (tens of millions of documents each with say 50-100 > > integers describing the image metrics). I am looking at using a part o= f > > those metrics for selecting a subset of images using range queries and > then > > more for sorting the result set by relevance. > > > > I was first looking at implementing those metrics as binary fields (see > > other posting) and then use a custom function for the distance > calculation > > but so far I got the impression that way is not supported really well b= y > > Solr. Base64-En/Decoding would kill performance and implementing a cust= om > > field type with all that is probably required for that to work properly > is > > currently beyond my Solr knowledge. Besides, using built-in Solr featur= es > > makes it easier to finetune/experiment with different approaches, > because I > > can just play around with different queries and see what works best, > > without each time adjusting a custom function. > > > > I hope that provides a better picture of what I am trying to achieve. > > > > Best, > > > > Robert > > > > On Fri, Oct 16, 2015 at 4:50 PM, Erick Erickson > > > wrote: > > > > > Under the covers, Lucene stores ints in a packed format, so I'd just > > count > > > on that for a first pass. > > > > > > What is "a lot of integer values"? Hundreds of millions? Billions? > > > Trillions? > > > > > > Unless you give us some indication of scale, it's hard to say anythin= g > > > helpful. But unless you have some evidence that your going to blow ou= t > > > memory I'd just ignore the "wasted" bits. Especially if you can use > > > docValues, > > > that option holds much of the underlying data in MMapDirectory > > > that uses swappable OS memory.... > > > > > > Best, > > > Erick > > > > > > On Fri, Oct 16, 2015 at 1:53 AM, Robert Kr=C3=BCger > > > wrote: > > > > Hi, > > > > > > > > I have a data model where I would store and index a lot of integer > > values > > > > with a very restricted range (e.g. 0-255), so theoretically the 32 > bits > > > of > > > > Solr's integer fields are complete overkill. I want to be able to t= o > > > things > > > > like vector distance calculations on those fields. Should I worry > about > > > the > > > > "wasted" bits or will Solr compress/organize the index in a way tha= t > > > > compensates for this if there are only 256 (or even fewer) distinct > > > values? > > > > > > > > Any recommendations on how my fields should be defined to make thin= gs > > > like > > > > numeric functions work as fast as technically possible? > > > > > > > > Thanks in advance, > > > > > > > > Robert > > > > > > > > > > > -- > > Robert Kr=C3=BCger > > Managing Partner > > Lesspain GmbH & Co. KG > > > > www.lesspain-software.com > > > > > > -- > Sincerely yours > Mikhail Khludnev > Principal Engineer, > Grid Dynamics > > > > --=20 Robert Kr=C3=BCger Managing Partner Lesspain GmbH & Co. KG www.lesspain-software.com --001a11c38720d945c505229db00e--