Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ABAD379B1 for ; Thu, 17 Nov 2011 20:08:13 +0000 (UTC) Received: (qmail 61351 invoked by uid 500); 17 Nov 2011 20:08:12 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 61311 invoked by uid 500); 17 Nov 2011 20:08:12 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 61304 invoked by uid 99); 17 Nov 2011 20:08:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2011 20:08:12 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2011 20:08:11 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 097498B8B8 for ; Thu, 17 Nov 2011 20:07:51 +0000 (UTC) Date: Thu, 17 Nov 2011 20:07:51 +0000 (UTC) From: "Hoss Man (Commented) (JIRA)" To: dev@lucene.apache.org Message-ID: <1131155308.39930.1321560471040.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1251950656.31298.1319808152214.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (SOLR-2864) DataImportHandler has non-deterministic sort order for XML files MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SOLR-2864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152287#comment-13152287 ] Hoss Man commented on SOLR-2864: -------------------------------- Gabriel: one thing that's not clear to me from your patch is the interplay between the deterministic sorting and the recursion: if i'm reading this right, directories are sorted deterministically by modification date, then they are walked recursively -- as opposed to doing a recursive walk, then sorting all the matching files by date. i have no opinion wether that's good or bad, but it seems worthy of consideration, testing, and documentation. minor nits: 1) if the goal is to be deterministic, then *only* sorting on last mod date doesn't seem like enough -- there should also be a secondary sort on something guaranteed to be unique (like fullpath or name) correct? 2) the CREATED_FIRST and CREATED_SECOND filename constants in your tests should be used consistently, there's no reason for multiple "a.xml" and "b.xml" strings in the test. using the constants everywhere will help make it clear that those files are being created in a specific order for a specific reason. > DataImportHandler has non-deterministic sort order for XML files > ---------------------------------------------------------------- > > Key: SOLR-2864 > URL: https://issues.apache.org/jira/browse/SOLR-2864 > Project: Solr > Issue Type: Bug > Components: contrib - DataImportHandler > Affects Versions: 3.4 > Reporter: Gabriel Cooper > Priority: Minor > Labels: dataimport, patch, xml > Fix For: 3.5 > > Attachments: lucene-2864.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > DataImportHandler's FileListEntityProcessor relies on Java's File.list() method to retrieve a list of files from the configured dataimport directory, but list() does not guarantee a sort order ^(1)^. This means that if you have two files that update the same record, the results are non-deterministic. Typically, list() does in fact return them lexigraphically sorted, but this is not guaranteed ^(2)^. > An example of how you can get into trouble is to imagine the following: > xyz.xml -- Created one hour ago. Contains updates to records "Foo" and "Bar". > abc.xml -- Created one minute ago. Contains updates to records "Bar" and "Baz". > In this case, the newest file, in abc.xml, would (likely, but not guaranteed) be run first, updating the "Bar" and "Baz" records. Next, the older file, xyz.xml, would update "Foo" and overwrite "Bar" with outdated changes. > (1) Per http://download.oracle.com/javase/1,5,0/docs/api/java/io/File.html#list%28%29 > "There is no guarantee that the name strings in the resulting array will appear in any specific order; they are not, in particular, guaranteed to appear in alphabetical order." > (2) Even if it was guaranteed, lexigraphical sorting would give you the following sort order: > 1.xml > 10.xml > 2.xml > ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org