Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CC74594CC for ; Tue, 27 Sep 2011 07:21:25 +0000 (UTC) Received: (qmail 12749 invoked by uid 500); 27 Sep 2011 07:21:25 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 12717 invoked by uid 500); 27 Sep 2011 07:21:25 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 12702 invoked by uid 99); 27 Sep 2011 07:21:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Sep 2011 07:21:24 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.215.176] (HELO mail-ey0-f176.google.com) (209.85.215.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Sep 2011 07:21:15 +0000 Received: by eyb7 with SMTP id 7so5905056eyb.35 for ; Tue, 27 Sep 2011 00:20:55 -0700 (PDT) Received: by 10.14.1.70 with SMTP id 46mr2513664eec.218.1317108055391; Tue, 27 Sep 2011 00:20:55 -0700 (PDT) Received: from acalculia.fritz.box (a82-161-128-246.adsl.xs4all.nl. [82.161.128.246]) by mx.google.com with ESMTPS id 11sm46862462eeu.4.2011.09.27.00.20.54 (version=SSLv3 cipher=OTHER); Tue, 27 Sep 2011 00:20:54 -0700 (PDT) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1244.3) Subject: Re: How to load lines into Hive while breaking them by words? From: Adriaan Tijsseling In-Reply-To: Date: Tue, 27 Sep 2011 09:20:53 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <75CDA121-B28B-4DB3-811F-067C3A2137C7@tijsseling.com> References: To: user@hive.apache.org X-Mailer: Apple Mail (2.1244.3) X-Virus-Checked: Checked by ClamAV on apache.org Use a regexserde to split the text in words. There's documentation on = the hive wiki.=20 But it might be better to use a script. See the post by Shouguo Li = earlier on this mailing list. After all, when you use a Python script, for example, you could use the = Natural Language Processing Toolkit to get much better function that = splits text into a list of words. Adriaan On 2011/09/27, at 05:40, Mark Kerzner wrote: > Hi, >=20 > a simple question - if I have a book as a text, and I want to load it = into a > Hive table, with one word forming one entry, how should I do it? >=20 > Thank you, > Mark