Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C6C4110AFE for ; Thu, 19 Feb 2015 13:44:56 +0000 (UTC) Received: (qmail 8888 invoked by uid 500); 19 Feb 2015 13:44:55 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 8819 invoked by uid 500); 19 Feb 2015 13:44:55 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 8802 invoked by uid 99); 19 Feb 2015 13:44:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Feb 2015 13:44:54 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (athena.apache.org: local policy) Received: from [74.125.82.46] (HELO mail-wg0-f46.google.com) (74.125.82.46) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Feb 2015 13:44:49 +0000 Received: by mail-wg0-f46.google.com with SMTP id a1so7315768wgh.5 for ; Thu, 19 Feb 2015 05:44:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gisfederal.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=x4l3nDFewJUvM7KvjqOeMN8hRN9xQZyBA3C4+H0q0+k=; b=StqPyqCwFqczWwQmJCc3TEaXH1FoLUHnEzgEgSOM2x0mg1t+lZzlHx4CNEr9m/b+RN ZFwqu4JspBVwXhNux8lXXdUXOG1QstAm/Rg9QZ8zucpa1ayi3P6gILj99XbYgGlRZdRv lAZL7vfsyQsOJo55PGHKtuxU8GymG98fFzF54= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=x4l3nDFewJUvM7KvjqOeMN8hRN9xQZyBA3C4+H0q0+k=; b=VMkDG3jJCcQ86YcQBLeMjMPsO5dkqZA/P8MAKebVPwu2oZZeCl8yt80mrSgOg3/0zo zXpvJeSitpE4irrBVQRZiHx5Ye9CVfiwzwV7bOv9Uipe9l22t3RfF8xFrSSQnQjvmPo6 qntABJ8/WpXmMo1L4gaISZOwaow7byxaS+RpnSM/0WXzvZ6QJf3SWrCKqM7WRn46zbry ovf5tjjgsftyJkME52KtQBwZI6hTPwaDnqwADVh1qTZXu+bRCL+Zyr7KxRh69JVF8PDe s4+oe+dlwqRAUe1V3WSJQvKx4f0pl/zZCtId4IhRZjDtT9AjLpkcdcdKiUXKqyS1d64C nxQA== X-Gm-Message-State: ALoCoQnQY+NyixlTXTnOE3qSK+URr2kZVY09GF6LuhkHUCSgxjRHK2LGru3eV9ljPlSuZRJ0xXId MIME-Version: 1.0 X-Received: by 10.194.120.132 with SMTP id lc4mr9582017wjb.92.1424353448620; Thu, 19 Feb 2015 05:44:08 -0800 (PST) Received: by 10.28.99.215 with HTTP; Thu, 19 Feb 2015 05:44:08 -0800 (PST) In-Reply-To: References: Date: Thu, 19 Feb 2015 08:44:08 -0500 Message-ID: Subject: Re: High frequency terms in results document.... From: Shouvik Bardhan To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=089e0118450a570b22050f7121dc X-Virus-Checked: Checked by ClamAV on apache.org --089e0118450a570b22050f7121dc Content-Type: text/plain; charset=UTF-8 Thanks for your input Uchida. I will try that out. I wonder what is the magic sauce in Luke's set of calls which allows it to create say top 100 terms even from a index with 100 million docs (small docs though for me). Looks like it goes thru every term and puts them in a priority queue and takes the top N. regards. On Thu, Feb 19, 2015 at 2:10 AM, Tomoko Uchida wrote: > Hi, > > I'm afraid there are no easy or straight way for your requirement. > I would try create an temporary tiny index from search results on the fly > in memory, and get top N terms from it by HighFreqTerms. > > http://lucene.apache.org/core/4_10_3/misc/org/apache/lucene/misc/HighFreqTerms.html > (The logic is almost same to Luke's top N terms feature) > > I have not tried ant not sure about this is practical approach in > performance, just an idea... > > Hope for it's help > Tomoko > > 2015-02-16 1:58 GMT+09:00 Shouvik Bardhan : > > > Apologies if I have missed it in discussions prior but I looked all > over. I > > looked at the Luke code and it does find high frequency terms on the > entire > > index. I am trying to get the top N high frequency terms in the documents > > returned from a search result. I came across something called > > FilterIndexReader but I don't think it is part of 4.X codebase. Any > pointer > > is appreciated. > > > --089e0118450a570b22050f7121dc--