Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 664487969 for ; Tue, 26 Jul 2011 09:51:03 +0000 (UTC) Received: (qmail 25664 invoked by uid 500); 26 Jul 2011 09:51:02 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 25299 invoked by uid 500); 26 Jul 2011 09:50:54 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 24899 invoked by uid 99); 26 Jul 2011 09:50:51 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jul 2011 09:50:51 +0000 Received: from localhost (HELO [10.0.0.77]) (127.0.0.1) (smtp-auth username gsingers, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jul 2011 09:50:50 +0000 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1084) Subject: Re: What about a universal input data handling mechanism for Mahout? From: Grant Ingersoll In-Reply-To: Date: Tue, 26 Jul 2011 05:50:32 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: References: <004a01cc4adf$63a1eae0$2ae5c0a0$@com> <4E2D8DC7.3060709@apache.org> <4E2D8FB6.1090409@apache.org> <4E2E6856.6090902@apache.org> To: dev@mahout.apache.org X-Mailer: Apple Mail (2.1084) We do have: SequenceFilesFromCsvFilter, although it is somewhat basic CSVVectorIterator, which takes a CSV file and produces a dense vector On Jul 26, 2011, at 3:58 AM, Ted Dunning wrote: > The critical design step here is to decide how to express the schema = of the > CSV file. There is a beginning of this in the CsvRecordFactory, but I = was > never happy with the (lack of) speed. >=20 > On Tue, Jul 26, 2011 at 12:10 AM, Sebastian Schelter = wrote: >=20 >> 2. SequenceFile is not file format that command line users can >>> prepare, is there tool for converting CSV files into SequenceFiles >>>=20 >>=20 >> I don't think we have that yet, but it would be very useful imho. >>=20 -------------------------- Grant Ingersoll