Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D2DBEC6CE for ; Mon, 14 May 2012 11:31:14 +0000 (UTC) Received: (qmail 89040 invoked by uid 500); 14 May 2012 11:31:12 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 88710 invoked by uid 500); 14 May 2012 11:31:11 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 88672 invoked by uid 99); 14 May 2012 11:31:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 May 2012 11:31:09 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 209.85.213.176 as permitted sender) Received: from [209.85.213.176] (HELO mail-yx0-f176.google.com) (209.85.213.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 May 2012 11:31:04 +0000 Received: by yenm14 with SMTP id m14so5336840yen.35 for ; Mon, 14 May 2012 04:30:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=7L2182QCF3BOBSxg62BmPw38kzN/XqNyYMAvRTkwKdI=; b=YBLsIEhydI5EWraYduvhRJX5niLimMFv8fg++sSd07EKl8MItF+2xhwcNfUWagQC6w WeaQ+S4OiwVnmQBeEAbMu6reKlgoZwRbcQe8wbMyl2Iv86NhsMkUFTor3KFr46VXVWDS WJAMwCjZ+DoVqvnQdgJtu9eDV6J5b1rJSkOM8UqgHRyTye2V9/uB0cPMk8DLU7h8jk7O olTxgkvbw7+z5MzVhghDKULIuWRg+lES0NR4d8ZIQO4ST1Z7fvMSAbgfPorFR69D7Hhh L6chWFaVCBgxyQ5+kJYajUwad3BLZuQBAKiIEZKju43mPp8f8abOp90r6LnEwJxoHWcA UEfw== MIME-Version: 1.0 Received: by 10.42.117.129 with SMTP id t1mr3485843icq.0.1336995043502; Mon, 14 May 2012 04:30:43 -0700 (PDT) Received: by 10.42.244.73 with HTTP; Mon, 14 May 2012 04:30:43 -0700 (PDT) In-Reply-To: References: Date: Mon, 14 May 2012 07:30:43 -0400 Message-ID: Subject: Re: Getting the frequencies by corresponding order of documents were indexed From: Erick Erickson To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org In general you can't rely on anything like this. I admit the merge stuff isn't my area of expertise, but when segments are merged, there's no guarantee that they're merged in order. In general the internal Lucene doc ID should be treated as predictable only for closed segments. Your solution of using your own unique ID is much better. Best Erick On Fri, May 11, 2012 at 8:50 AM, Ian Lea wrote: > What version of lucene are you using? =A0If not the latest, try that. > If you really think there is a lucene bug post a small self-contained > test case that demonstrates the problem. > > > -- > Ian. > > > On Fri, May 11, 2012 at 12:35 PM, Kasun Perera wro= te: >> On Fri, May 11, 2012 at 4:52 PM, Ian Lea wrote: >> >>> Can't spot anything obviously wrong in your code and what you are >>> trying to do should work. =A0Are you positive that what you think is th= e >>> second doc is really being added second? =A0You only show one doc being >>> added. =A0Are there already 7 docs in the index before you start? >>> >>> >>> >> Hi Ian >> >> yes I'm sure 2nd doc is added second and I use debugger several times to >> confirm it. If I index 10 documents, I'm getting 10 termFrequncy vectors >> but their positions are changed. I gave doc #2 as example. =A0#5th >> termfrequncy vector is correspond to doc and so on. >> >> I figured out to overcome this but it may be not efficient. I stored >> another field at indexing time, base on the content inside new field i'm >> able to map the doc with its termfrequncy vector. Is there any other >> efficient way? This may be a bug in Lucene? >> >> Thanks >> >>> -- >>> Ian. >>> >>> >>> On Fri, May 11, 2012 at 8:58 AM, Kasun Perera >>> wrote: >>> > I have collection of documents (say 10 documents)and i'm indexing the= m >>> this >>> > way, by storing the term vector >>> > >>> > StringReader strRdElt =3D new StringReader(content); >>> > >>> > >>> > =A0 =A0Document doc =3D new Document(); >>> > >>> > =A0 =A0String docname=3DdocNames[docNo]; >>> > >>> > =A0 =A0doc.add(new Field("doccontent", strRdElt, Field.TermVector.YES= )); >>> > >>> > =A0 =A0IndexWriter iW; >>> > =A0 =A0try { >>> > >>> > =A0 =A0 =A0 =A0NIOFSDirectory dir =3D new NIOFSDirectory(new File(pat= hToIndex)) ; >>> > >>> > =A0 =A0 =A0 =A0iW =3D new IndexWriter(dir, new IndexWriterConfig(Vers= ion.LUCENE_35, >>> > >>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0new StandardAnalyzer(Version.LUCENE_35= ))); >>> > >>> > =A0 =A0 =A0 =A0iW.addDocument(doc); >>> > =A0 =A0 =A0 =A0iW.close(); >>> > >>> > =A0 =A0} >>> > >>> > After Index all the documents, i'm getting the term-frequencies of ea= ch >>> > document this way >>> > >>> > >>> > IndexReader re =3D IndexReader.open(FSDirectory.open(new >>> > File(pathToIndex)), true) ; >>> > TermFreqVector termsFreq[]; >>> > for(int i=3D0;i>> > =A0 =A0 =A0 =A0termsFreq[i] =3D re.getTermFreqVector(i, "doccontent")= ; >>> > >>> > =A0 =A0 =A0} >>> > >>> > my problem is i'm not getting the termfreqncy vector correspondingly.= Say >>> > for 2nd document that I have indexed i'm getting it's corresponding >>> > termfrequncies and terms at "termsFreq[9]" >>> > >>> > What is the reason for that?, how can I get the corresponding >>> > termfrequncies by the order that I have indexed the documents? >>> > >>> > >>> > -- >>> > Regards >>> > >>> > Kasun Perera >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >> >> >> -- >> Regards >> >> Kasun Perera > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org