Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 26584 invoked from network); 14 Dec 2006 19:17:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 14 Dec 2006 19:17:50 -0000 Received: (qmail 81985 invoked by uid 500); 14 Dec 2006 19:17:50 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 81960 invoked by uid 500); 14 Dec 2006 19:17:50 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 81949 invoked by uid 99); 14 Dec 2006 19:17:50 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Dec 2006 11:17:50 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of soeren.pekrul@gmx.de designates 213.165.64.20 as permitted sender) Received: from [213.165.64.20] (HELO mail.gmx.net) (213.165.64.20) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 14 Dec 2006 11:17:40 -0800 Received: (qmail invoked by alias); 14 Dec 2006 19:17:17 -0000 Received: from p548C514D.dip.t-dialin.net (EHLO [10.0.1.102]) [84.140.81.77] by mail.gmx.net (mp027) with SMTP; 14 Dec 2006 20:17:17 +0100 X-Authenticated: #3493418 Message-ID: <4581A328.2030909@gmx.de> Date: Thu, 14 Dec 2006 20:16:56 +0100 From: Soeren Pekrul User-Agent: Mozilla Thunderbird 1.0.7 (Windows/20050923) X-Accept-Language: de-DE, de, en-us, en MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Lucene & LSA References: <7858202.post@talk.nabble.com> <4581153D.9060106@gmx.de> <7870561.post@talk.nabble.com> In-Reply-To: <7870561.post@talk.nabble.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Y-GMX-Trusted: 0 X-Virus-Checked: Checked by ClamAV on apache.org mariolone wrote: > They are successful to extract the matrix. > But with collections of large documents is not one too much expensive > solution? I have a quite small collection with 14,960 documents and 29,828 unique terms. If I remember right it took a few minutes on a normal laptop computer to iterate the terms and documents. I stored the matrix in mySQL: CREATE TABLE term_document_matrix ( term VARCHAR( 32 ) NOT NULL , document INT NOT NULL , weight DOUBLE NOT NULL DEFAULT '0', PRIMARY KEY (term, document) ); You can see it is not a real matrix just a normal table in the relational model. I stored the weights greater than 0 only, so I have much less entries than 14,960 x 29,828 = 446,226,880 (in my case 159,407). > it is possible to extract the matrix from the indexing file? I don’t know any API to extract the matrix from the index file directly. Sören --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org