Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 60770100BA for ; Tue, 7 May 2013 19:45:15 +0000 (UTC) Received: (qmail 41184 invoked by uid 500); 7 May 2013 19:45:15 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 41150 invoked by uid 500); 7 May 2013 19:45:15 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 41142 invoked by uid 99); 7 May 2013 19:45:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 May 2013 19:45:15 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of Sean.Finan@childrens.harvard.edu designates 134.174.13.91 as permitted sender) Received: from [134.174.13.91] (HELO mailsmtp1.childrenshospital.org) (134.174.13.91) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 May 2013 19:45:11 +0000 Received: from pps.filterd (mailsmtp1.childrenshospital.org [127.0.0.1]) by mailsmtp1.childrenshospital.org (8.14.5/8.14.5) with SMTP id r47JhMes013660 for ; Tue, 7 May 2013 15:44:49 -0400 Received: from smtpndc1.chboston.org (smtpndc1.chboston.org [10.20.50.104]) by mailsmtp1.childrenshospital.org with ESMTP id 1c6ppafqef-1 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NOT) for ; Tue, 07 May 2013 15:44:49 -0400 Received: from pps.filterd (smtpndc1.chboston.org [127.0.0.1]) by smtpndc1.chboston.org (8.14.5/8.14.5) with SMTP id r47JdrC5002427 for ; Tue, 7 May 2013 15:44:49 -0400 Received: from chexhubcasbdc1.chboston.org (chexhubcasbdc1.chboston.org [10.20.18.71]) by smtpndc1.chboston.org with ESMTP id 1c65shng80-1 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Tue, 07 May 2013 15:44:49 -0400 Received: from CHEXMBX2A.CHBOSTON.ORG ([fe80::890c:1b68:e6dc:cd6e]) by CHEXHUBCASBDC1.CHBOSTON.ORG ([::1]) with mapi id 14.02.0342.003; Tue, 7 May 2013 15:44:48 -0400 From: "Finan, Sean" To: "dev@ctakes.apache.org" Subject: RE: files vs strings in collection reader Thread-Topic: files vs strings in collection reader Thread-Index: AQHOS1ewfw7d7nKQ5kidmb4fTZafS5j6XYaA///AlXA= Date: Tue, 7 May 2013 19:44:48 +0000 Message-ID: <393252F14C42F946952F1ED75D316CAD385EE2B4@CHEXMBX2A.CHBOSTON.ORG> References: <51895365.1020707@childrens.harvard.edu> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.7.2.177] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.10.8626,1.0.431,0.0.0000 definitions=2013-05-07_08:2013-05-07,2013-05-07,1970-01-01 signatures=0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.10.8626,1.0.431,0.0.0000 definitions=2013-05-07_08:2013-05-07,2013-05-07,1970-01-01 signatures=0 X-Virus-Checked: Checked by ClamAV on apache.org I don't think that File instantiation is more slow than the ae process, and= Tim is talking about tens of thousands of files in the directory tree. =20 The only filesystem call that should exist in any new File(..) is a normali= ze(..) or resolve(..) on the passed parameter(s), which should just be stri= ng manipulation and no actual io calls, native or otherwise. In other word= s, new File(..) should be fast. -----Original Message----- From: ksarma@gmail.com [mailto:ksarma@gmail.com] On Behalf Of Karthik Sarma Sent: Tuesday, May 07, 2013 3:26 PM To: dev@ctakes.apache.org Subject: Re: files vs strings in collection reader Hmm, without having actually reviewed the code in cTAKES (I'm not on my wor= k computer), my understanding of the "correct" way of doing this is to use = the listFiles method on the directory File to get an array of Files; this s= hould be implemented natively by the JVM and could be faster than individua= l initialization. -- Karthik Sarma UCLA Medical Scientist Training Program Class of 20?? Member, UCLA Medical Imaging & Informatics Lab Member, CA Delegation to the= House of Delegates of the American Medical Association ksarma@ksarma.com gchat: ksarma@gmail.com linkedin: www.linkedin.com/in/ksarma On Tue, May 7, 2013 at 12:17 PM, Tim Miller < timothy.miller@childrens.harv= ard.edu> wrote: > The FilesInDirectoryCollectionRead**er creates an arraylist of=20 > java.io.File objects when it is initialized. For large datasets (~50k > files) this is substantial time overhead and probably memory as well.=20 > Seems like it would be more efficient to use Strings instead of Files=20 > there and just open the File object when getNext() is called. It is=20 > pretty easy to implement, any downside to making this switch? > Tim >