Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B32A61979C for ; Tue, 15 Mar 2016 16:41:34 +0000 (UTC) Received: (qmail 30155 invoked by uid 500); 15 Mar 2016 16:41:33 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 30102 invoked by uid 500); 15 Mar 2016 16:41:33 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 30090 invoked by uid 99); 15 Mar 2016 16:41:32 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Mar 2016 16:41:32 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 17C3C1804C9 for ; Tue, 15 Mar 2016 16:41:32 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.198 X-Spam-Level: * X-Spam-Status: No, score=1.198 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id A4c91Ugs4Ev3 for ; Tue, 15 Mar 2016 16:41:30 +0000 (UTC) Received: from mail-lb0-f172.google.com (mail-lb0-f172.google.com [209.85.217.172]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id CDB645FAE1 for ; Tue, 15 Mar 2016 16:41:30 +0000 (UTC) Received: by mail-lb0-f172.google.com with SMTP id k12so30740697lbb.1 for ; Tue, 15 Mar 2016 09:41:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=CT7V7m03MGEyG3xtEE0Kdth7mH9+n/OAzxgRtGF0/sk=; b=QCA2c30Z5tE/BaxXMap7QpPNGwR4aGlJxDRkLc8gd6VO9s0FNGJ8QP1e5BSKvQLOdU TYB/tjPNxjXU2SW48zlLdew9KM1GOaJvqdyPOlWGYyry5xBVbGfqerNDG0NmLaOXgG8q gQduvVXwdZsdu6Mp98JoXK6Bw2+iXY8Sm83Nd9Lqto29k+p1CNVFp/63eOLXlty7PhwX FAguQLliFyJHntD1WEQAFCdRKltVFZqI2wFytyGyL2kNGrC4N03pPXSH+tiQfyN4eBgd x+ZaPa8NtyCYxscAu5F2L5dArnbD2+brZohSPNmx6sfR+UzX/APDsL7PU82Nf6cHXENb /szA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=CT7V7m03MGEyG3xtEE0Kdth7mH9+n/OAzxgRtGF0/sk=; b=hdOwUVHserDyMqvEKxWC1rQBpZs5wjgGu0Dt37TK6ZluyUxTLNpSqolTwFh9adhrdd a8H6ztcblQyTTjOft5rNXQFLgVKhj4G7cFy/by9HAj4XSQpVXgB4En4+9c7db9PpYPvx HKA8BpezaV2E7mYyrwuyfItlTUKaM1KmH2LR48F+87PWrjV9cWmbFFK8zMsK1QUXACEc UzEh2Sh/B6QJVnzrLVyyYKVZN7eC+bTN2+heGq3f9l1JWHsIaFnJYvc2jOO5AIFzr44A jDjWJwHbpgAOQhedMECynERTLVRLdEx66iiVyV0S8eQeOmT36WvN47qYfWNdx8AJ8TAq AZoA== X-Gm-Message-State: AD7BkJJHBU4VxSSs2MD7VD8wjRJ3d+0lUva9B4Uceb8mT/c/PnUqoJy1C5a3Fq5Ps2cRQnKSX34ac1cMLGpX2w== X-Received: by 10.25.31.80 with SMTP id f77mr8530927lff.18.1458060090346; Tue, 15 Mar 2016 09:41:30 -0700 (PDT) MIME-Version: 1.0 References: <71b86d19-2d2d-e809-309f-3fb7213ac907@codetrails.com> In-Reply-To: <71b86d19-2d2d-e809-309f-3fb7213ac907@codetrails.com> From: Adrien Grand Date: Tue, 15 Mar 2016 16:41:20 +0000 Message-ID: Subject: Re: Canonicalize stored fields (small set of possible values) To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001a11402208bf12f6052e191157 --001a11402208bf12f6052e191157 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Le mar. 15 mars 2016 =C3=A0 17:33, Andreas Sewe a =C3=A9crit : > I am afraid I don't understand. Do you suggest using IntFields as ID > instead of StringFields, as they are presumably stored more efficiently? > Exactly. Integers are stored using zig-zag encoding and variable byte. So numbers between -64 and 63 use 1 byte, numbers between -8192 and 8191 use 2 bytes, etc. > > Otherwise, even without doing anything, things > > should not be too bad thanks to stored fields compression. > > AFAICT, the fields are not compressed on disk right now. At least, "grep > -c" finds my field over and over in the index files. > > So, how do I enabled stored fields compression. Googling turned up > Store.COMPRESS, but that doesn't exist in 5.2.1. > Compression is on by default, but we split the stored fields file into blocks of 16KB and compress each block individually. So each 16KB block still needs to store values at least once before the compression algorithm can make references to it. If you want to enable stronger compression, you can do `indexWriterConfig.setCodec(new Lucene54Codec(Mode.BEST_COMPRESSION))` which will use DEFLATE insead of LZ4 to compress blocks. In addition of removing duplicates like LZ4, DEFLATE also applies some Huffman coding so that you should see better compression if your field values use some symbols much more frequently than others. --001a11402208bf12f6052e191157--