Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 85227 invoked from network); 6 Nov 2006 18:00:59 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Nov 2006 18:00:59 -0000 Received: (qmail 40968 invoked by uid 500); 6 Nov 2006 18:01:04 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 40950 invoked by uid 500); 6 Nov 2006 18:01:03 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Delivered-To: moderator for java-user@lucene.apache.org Received: (qmail 84561 invoked by uid 99); 6 Nov 2006 16:51:16 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) X-IronPort-AV: i="4.09,392,1157342400"; d="scan'208"; a="18791929:sNHT66368197" Message-ID: From: "Robichaud, Jean-Philippe" To: java-user@lucene.apache.org Subject: RE: "Catalog" backend for document stored fields? Date: Mon, 6 Nov 2006 11:50:37 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2658.27) Content-Type: text/plain X-Virus-Checked: Checked by ClamAV on apache.org [sorry for the long delay for my answer, we are having some issues with our mail server...] Thanks for your comment. Yes it would make sense if the log files were not so big. In fact, I'm only indexing a subset of the log information. Because I store the information in Lucene, it is easier and faster to retrieve the information. For example, generating a report by reading the logs themselves takes ~18 hours. Converting to xml and indexing the logs takes ~12 hours and each report takes <20 minutes to generate. Since we often generate (different) reports, the Lucene approach is way faster. [Note that I wrote both classes to convert to xml and to index the logs. The log2xml step is quite useful because the xml really has all the log information and compressing the xml with xmlppm reduces the size of the logs by 97%. This way I can archive the log.xml without wasting much space.] As another comparison point: the logs takes 100Gig per week while the indices 35Gig per month! This design is already more optimal than the first approach. But I'm trying to make it better. I really do think that this dictionary/catalog approach could benefit others Lucene users. I'm not against the idea of doing it myself. I just need some pointers and guidelines for all the gurus out there! Thanks for all you help! Jp -----Original Message----- From: Doron Cohen [mailto:DORONC@il.ibm.com] Sent: Tuesday, October 24, 2006 1:50 AM To: java-user@lucene.apache.org Subject: Re: "Catalog" backend for document stored fields? > I'm indexing logs from a transaction-based application. > ... > millions documents per month, the size of the indices is ~35 gigs per month > (that's the lower bound). I have no choice but to 'store' each field values > (as well as indexing/tokenizing them) because I'll need to retrieve them in > order to create various reports. Also, I have a backlog of ~2 years of logs > to index! > ... > 1- is there someone out there that already wrote an extension to > Lucene so that 'stored' string for each document/field is in fact stored in > a centralized repository? Meaning, only an 'index' is actually stored in the > document and the real data is put somewhere else. Do you gain anything from storing the document fields within Lucene? In case not, especially if log files are kept somewhere, you cuold make all 'content' fields unstored (reduce index size), and add a stored non-indexed ID field. It can also be a POINTER field - e.g. . At search time, for found documents you can retrieve this ID/POINTER field and then fetch the document from the (original) log file. Makes sense? --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org