lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martin Oberhuber (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3607) Lucene Index files can not be reproduced faithfully (due to timestamps embedded)
Date Wed, 30 Nov 2011 11:21:41 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159983#comment-13159983
] 

Martin Oberhuber commented on LUCENE-3607:
------------------------------------------

Hi all,

thanks for the many comments. I understand that there's no desire changing behavior that's
been working (and documented!) for years.

What about a different approach ... would it be possible to write a small Java "main" that
normalizes an index, very much like "stripping" an EXE ? That way I could postprocess my indexes
(which are meant for distribution with our product), but at its core Lucene could continue
working as today.

Regarding some other comments,

- Our main reason for shipping a pre-built index is "initial search" performance. In a large
eclipse based product, generating the docs index on initial search can take approx 4 minutes
on a decent computer. With everything pre-indexed, initial search can proceed after 10 seconds.
That's an important usability issue for our help system. Another reason is the desire to find
any index building errors at build-time (where we can investigate them) rather than runtime.

- We do have both the build environment and the deployment environment under full control
(same lucene version, same JVM version, same ICU version, all our content is en_US).

- Regarding heuristics ... sure the search is heuristic at runtime, but that's a very different
thing than having the build environment heuristic... having identical input produce identical
output is still desirable.

- The issue of different analyzes used at index generation time vs. runtime has indeed bitten
us in the past (see [[https://bugs.eclipse.org/bugs/show_bug.cgi?id=219928#c16]]). In my personal
opinion, the choice of analyzer should be bound to the content, and not to the search environment
... since in many cases the language of the search string will not be known, but the language
of the documents / index is known. Right now, the best workaround for this at Eclipse is launching
Eclipse with a "-nl en_US" argument to force US locale when I know all the docs are US...
but that won't work at all in an environment where some docs are English and others are German,
a very common scenario with software products on Eclipse (main product may be localized but
some plugins are not).

Is that "analyzer binding to content vs. binding to search" issue known and discussed at Lucene
already ? I.e. is it possible to have parts of the index (the US one) searched with an US
analyzer but other parts (the German one) with a German analyzer ? And, why does the German
analyzer truncate words at "." while the US one does not (See [[https://bugs.eclipse.org/bugs/show_bug.cgi?id=219928#c18
]]) ?
                
> Lucene Index files can not be reproduced faithfully (due to timestamps embedded)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3607
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 2.9.1
>         Environment: Eclipse 3.7
>            Reporter: Martin Oberhuber
>            Assignee: Michael McCandless
>
> Eclipse 3.7 uses Lucene 2.9.1 for indexing online help content. A pre-generated help
index can be shipped together with online content. As per
>    [[https://bugs.eclipse.org/bugs/show_bug.cgi?id=364979 ]]
> it turns out that the help index can not be faithfully reproduced during a build, because
there are timestamps embedded in the index files, and the "NameCounter" field in segments_2
contains different contents on every build.
> Not being able to faithfully reproduce the index from identical source bits undermines
trust in the index (and software delivery) being correct.
> I'm wondering whether this is a known issue and/or has been addressed in a newer Lucene
version already ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message