Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1B8C510524 for ; Mon, 23 Sep 2013 11:57:25 +0000 (UTC) Received: (qmail 90357 invoked by uid 500); 23 Sep 2013 11:56:38 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 90200 invoked by uid 500); 23 Sep 2013 11:56:18 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 90155 invoked by uid 99); 23 Sep 2013 11:56:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Sep 2013 11:56:08 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.220.178] (HELO mail-vc0-f178.google.com) (209.85.220.178) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Sep 2013 11:56:02 +0000 Received: by mail-vc0-f178.google.com with SMTP id ha12so2004365vcb.23 for ; Mon, 23 Sep 2013 04:55:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; bh=90QXgXUNPKmVL9+/uhXqkJGQN1tJKnpXpwrofzQX2Vk=; b=at1I+5m9HWlyhBHkJU2ORzAt9OLxqHyw/DraJsa8PIAiX8P/PkB6/7YFotYxffTIjy 7UFAHgFAor1B3svEbHTbRks378SGdefNVhlrquRtirONs9hW4r3hrmT0Bi1vlEamRRCy 4X/Ji1PhfddzGh3Uph5bLAVF9fNhlkZ0rWBIndkWYtX0IwrWNZ1OxDC7tphk2wecMlcU DiPImGxDuLdukvJkqBE9dWoDYStD31fxf6TIBJU3Ajza7IshknZbCWArSCaXP98+Pzp9 JZkeMvqiagEWIepDqmK4HrivbQ35dB7FLU/PbMABzkLKJWyDojJa/qJUSFXhJTVOf/NB 0JMA== X-Gm-Message-State: ALoCoQnnjD/VPG+S0p9LnISgSbzMR9Gl/8lvaYxc4hNZJmKqRIfZwtqHqr2X1ApAzbiHVQisEmSo X-Received: by 10.221.40.10 with SMTP id to10mr1329307vcb.22.1379937340773; Mon, 23 Sep 2013 04:55:40 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.72.131 with HTTP; Mon, 23 Sep 2013 04:55:20 -0700 (PDT) In-Reply-To: References: From: Michael McCandless Date: Mon, 23 Sep 2013 07:55:20 -0400 Message-ID: Subject: Re: Understanding FST Prefix & CheckIndex output To: Lucene Users Cc: manuel.lenormand@gmail.com Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org On Sun, Sep 22, 2013 at 9:34 AM, Manuel Le Normand wrote: > Hi there, > I try to deep dive into the inner LucenePostingFormat to check what might I > do for improving query performance. I'm curious about the termBlock stats > that I get from checkIndex -verbose. > > What does the followong mean: > index FST bytes - the FST size, which is the field's partition of the .tip > file? Yes. > num of terms - written 2M, although Luke interface shows me 8M, how come? This should be the total unique term count for that field; not sure why Luke disagrees. Maybe Luke is printing Terms.sumTotalTermFreq = total number of term occurrences (not unique terms). > term / index FST bytes - summing up all my fields bytes doesn't get me > close to the .tim / tip file, how come? The .tim file also contains the postings metadata (file pointers into .doc, .pos, skip length, etc.), so .tim file size will be larger than the terms + index bytes as reported by CheckIndex. > blocks - these are the SUFFIX blocks (.tim files), which are implemented as > Burst Tries, right? Yes; this is basically the number of nodes in the prefix trie, where each node has a block of terms' suffixes, plus their metadata (docFreq, totalTermFreq) plus the postings metadata. > block types - where can I get the info about these different types? A terms-only block is a "leaf" node in the prefix trie: it has no child sub-blocks, so all of its entries are terms. A sub-clock only block is a non-leaf node that has only pointers to other blocks, i.e. it contains no terms itself. A mixed block is all other blocks: they contain at least one term and at least one sub-block. The total number of blocks is equal to the sum of these three counts. A floor block is orthogonal to the above breakdown, and is used when a single block would contain too many entries. This is necessary because the problem of dividing up the terms into blocks of size 25 - 48 terms or sub-blocks is over-constrained. Since the labels are bytes, in the worst case you could have 255 entries in a block; when this happens, we split the block into an initial floor block, followed by a number of "sub-floor" blocks. When seeking, the FST points to each of these floor / sub-floor blocks so e.g. if you are trying to get to byte 255 at this point, you don't have to scan through the first 255 entries to get there. > As background, my main performance issue is (random?) read miss IO while > looking up terms in the BlockTreeTerm (tim files, right?) on heavy-termed > queries, so my optimization is avoiding IO's. Maybe try the new in-memory terms dict? It holds all terms + metadata in RAM so there's no seeking when looking up the term. But the postings are still on disk, so seeking is necessary there. This is currently trunk only, FSTTerms* under lucene/codec. You could also try MemoryPostingsFormat, which holds all terms + postings in RAM. But you need to have enough RAM vs your index size for the fields, for these :) > That said, is there any > reason getting the right block will require more than IO > (of 4kB)? In the worst case it's 2 seeks per term: first the FST tells us which block to load from the .tim file, so we seek + scan to it. Then, assuming you need to visit the documents for this term, it's another seek to get to the right block for the postings, unless there was only 1 doc for this term and you indexed DOCS_ONLY, in which case that docID is inlined into the term metadata (so only one seek). If you also need positions, then that will be another seek; if payloads/offsets, another seek. > Does a certain distribution of prefix length of block types should alarm me > in some way? > > field "text_txt" > index FST: > 18300 nodes > 45779 arc > 583438 bytes > term: > 2053393 terms > 25597203 bytes (12.5 bytes/term) > blocks: > 66086 blocks > 51870 terms-only blocks > 47 sub-block-only blocks > 14169 mixed blocks > 13599 floor blocks > 22862 non-floor blocks > 43224 floor sub-blcoks > 18289568 term suffix bytes (276.8 suffix-bytes/block) > 4174480 term stas bytes (63.2 stats-bytes/block) > 7632796 other bytes (115.5 stats-bytes/block) > by prefix length: > 0: 1 > 1: 683 > 2: 10782 > 3. 17133 > > etc... There are term distributions that are "better", but there should be no "adversary" case that cripples the performance. E.g. if you use UIDs, the best performance should be 0-prefixed base-N ints, where N is over 25 and well below 48, is best, and they are sequentially assigned. Random UIDs are worst! But, no distribution by prefix length should be "alarming" ... Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org