Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 77EEC97E3 for ; Wed, 30 Nov 2011 11:22:06 +0000 (UTC) Received: (qmail 57673 invoked by uid 500); 30 Nov 2011 11:22:05 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 57607 invoked by uid 500); 30 Nov 2011 11:22:05 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 57600 invoked by uid 99); 30 Nov 2011 11:22:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Nov 2011 11:22:05 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Nov 2011 11:22:02 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 125EBA8253 for ; Wed, 30 Nov 2011 11:21:41 +0000 (UTC) Date: Wed, 30 Nov 2011 11:21:41 +0000 (UTC) From: "Martin Oberhuber (Commented) (JIRA)" To: dev@lucene.apache.org Message-ID: <292421437.25760.1322652101076.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <311676468.18760.1322504979934.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (LUCENE-3607) Lucene Index files can not be reproduced faithfully (due to timestamps embedded) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159983#comment-13159983 ] Martin Oberhuber commented on LUCENE-3607: ------------------------------------------ Hi all, thanks for the many comments. I understand that there's no desire changing behavior that's been working (and documented!) for years. What about a different approach ... would it be possible to write a small Java "main" that normalizes an index, very much like "stripping" an EXE ? That way I could postprocess my indexes (which are meant for distribution with our product), but at its core Lucene could continue working as today. Regarding some other comments, - Our main reason for shipping a pre-built index is "initial search" performance. In a large eclipse based product, generating the docs index on initial search can take approx 4 minutes on a decent computer. With everything pre-indexed, initial search can proceed after 10 seconds. That's an important usability issue for our help system. Another reason is the desire to find any index building errors at build-time (where we can investigate them) rather than runtime. - We do have both the build environment and the deployment environment under full control (same lucene version, same JVM version, same ICU version, all our content is en_US). - Regarding heuristics ... sure the search is heuristic at runtime, but that's a very different thing than having the build environment heuristic... having identical input produce identical output is still desirable. - The issue of different analyzes used at index generation time vs. runtime has indeed bitten us in the past (see [[https://bugs.eclipse.org/bugs/show_bug.cgi?id=219928#c16]]). In my personal opinion, the choice of analyzer should be bound to the content, and not to the search environment ... since in many cases the language of the search string will not be known, but the language of the documents / index is known. Right now, the best workaround for this at Eclipse is launching Eclipse with a "-nl en_US" argument to force US locale when I know all the docs are US... but that won't work at all in an environment where some docs are English and others are German, a very common scenario with software products on Eclipse (main product may be localized but some plugins are not). Is that "analyzer binding to content vs. binding to search" issue known and discussed at Lucene already ? I.e. is it possible to have parts of the index (the US one) searched with an US analyzer but other parts (the German one) with a German analyzer ? And, why does the German analyzer truncate words at "." while the US one does not (See [[https://bugs.eclipse.org/bugs/show_bug.cgi?id=219928#c18 ]]) ? > Lucene Index files can not be reproduced faithfully (due to timestamps embedded) > -------------------------------------------------------------------------------- > > Key: LUCENE-3607 > URL: https://issues.apache.org/jira/browse/LUCENE-3607 > Project: Lucene - Java > Issue Type: Bug > Components: core/index > Affects Versions: 2.9.1 > Environment: Eclipse 3.7 > Reporter: Martin Oberhuber > Assignee: Michael McCandless > > Eclipse 3.7 uses Lucene 2.9.1 for indexing online help content. A pre-generated help index can be shipped together with online content. As per > [[https://bugs.eclipse.org/bugs/show_bug.cgi?id=364979 ]] > it turns out that the help index can not be faithfully reproduced during a build, because there are timestamps embedded in the index files, and the "NameCounter" field in segments_2 contains different contents on every build. > Not being able to faithfully reproduce the index from identical source bits undermines trust in the index (and software delivery) being correct. > I'm wondering whether this is a known issue and/or has been addressed in a newer Lucene version already ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org