Return-Path: Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: (qmail 36459 invoked from network); 4 Sep 2009 22:27:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Sep 2009 22:27:58 -0000 Received: (qmail 99436 invoked by uid 500); 4 Sep 2009 22:27:57 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 99353 invoked by uid 500); 4 Sep 2009 22:27:57 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 99343 invoked by uid 99); 4 Sep 2009 22:27:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Sep 2009 22:27:56 +0000 X-ASF-Spam-Status: No, hits=3.8 required=10.0 tests=RCVD_NUMERIC_HELO,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [208.97.132.83] (HELO spunkymail-a16.g.dreamhost.com) (208.97.132.83) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Sep 2009 22:27:46 +0000 Received: from 251.138.224.10.in-addr.arpa (m990736d0.tmodns.net [208.54.7.153]) by spunkymail-a16.g.dreamhost.com (Postfix) with ESMTP id 7FB887B3DB for ; Fri, 4 Sep 2009 15:27:24 -0700 (PDT) Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes Mime-Version: 1.0 (Apple Message framework v1075.2) Subject: Re: capturing field length into a stored document field From: Grant Ingersoll In-Reply-To: <25297690.post@talk.nabble.com> Date: Fri, 4 Sep 2009 17:27:23 -0500 Content-Transfer-Encoding: 7bit Message-Id: References: <25297690.post@talk.nabble.com> To: solr-user@lucene.apache.org X-Mailer: Apple Mail (2.1075.2) X-Virus-Checked: Checked by ClamAV on apache.org The Similarity.lengthNorm() is a callback from Lucene that gives you the information you seek. Of course, the trick still is how to use that. Perhaps you can describe a bit more about why you need that length. On Sep 4, 2009, at 11:34 AM, mike.schultz wrote: > > For various statistics I collect from an index it's important for me > to know > the length (measured in tokens) of a document field. I can get that > information to some degree from the "norms" for the field but a) the > resolution isn't that great, and b) more importantly, if boosts are > used > it's almost impossible to get lengths from this. > > Here's two ideas I was thinking about that maybe some can comment on. > > 1) Use copyto to copy the field in question, fieldA to an addition > field, > fieldALength, which has an extra filter that just counts the tokens > and only > outputs a token representing the length of the field. This has the > disadvantage of retokenizing basically the whole document (because > the field > in question is basically the body). Plus I would think littering > the term > space with these tokens might be bad for performance, I'm not sure. > > 2) Add a filter to the field in question which again counts the > tokens. > This filter allows the regular tokens to be indexed as usual but > somehow > manages to get the token-count into a stored field of the document. > This > has the advantage of not having to retokenize the field and instead of > littering the token space, the count becomes docdata for each doc. > Can this > be done? Maybe using threadLocal to temporarily store the count? > > Thanks. > > -- > View this message in context: http://www.nabble.com/capturing-field-length-into-a-stored-document-field-tp25297690p25297690.html > Sent from the Solr - User mailing list archive at Nabble.com. > -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search