Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8A288116D1 for ; Tue, 8 Apr 2014 18:31:56 +0000 (UTC) Received: (qmail 38426 invoked by uid 500); 8 Apr 2014 18:31:45 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 38341 invoked by uid 500); 8 Apr 2014 18:31:45 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 38334 invoked by uid 99); 8 Apr 2014 18:31:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Apr 2014 18:31:43 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates 209.85.223.178 as permitted sender) Received: from [209.85.223.178] (HELO mail-ie0-f178.google.com) (209.85.223.178) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Apr 2014 18:31:38 +0000 Received: by mail-ie0-f178.google.com with SMTP id lx4so1299203iec.23 for ; Tue, 08 Apr 2014 11:31:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=EhHXAFs+WQm7y6E0ixI50yvE1QrwP7Y4eL7cDHt7DEA=; b=X49uG/iynTqOGFe00KSYX03josg7C9GTyvwPetAw78AnGbHkaYdNnya//tTZzDS5UP LB2bxpN4n+z9d6sTXbuaWr//iwURtM6t9fq3uDRx1KY8324/jXEqObYzM0itm328iuWW 26sTv1my51G5IIYtcN5GOQ6VASS2G07jecYrC/BDFX+7qMaCkJAl2r6ttE7t4HT2h9nm dZ+GnfiVlJf3+chpWonsh7kVySNArjKx/7JLxmSHQLq2yB5o5WW8u4lVumQQXrdOPuDM JBd8UUXJULwuFAURSogagth2BoIDB0KbOk2WNt50DhvIhZholRfGK3Nai1mbadKClhzo VP+A== X-Gm-Message-State: ALoCoQkAmcWRPbCoQN3AJmOLG5Ki69/cSD8I94PMo+PJImCe/4po8bs7CW0B4ZEoVPjPcmXTaT11 X-Received: by 10.51.15.195 with SMTP id fq3mr5914084igd.5.1396981877775; Tue, 08 Apr 2014 11:31:17 -0700 (PDT) MIME-Version: 1.0 Received: by 10.50.61.97 with HTTP; Tue, 8 Apr 2014 11:30:57 -0700 (PDT) In-Reply-To: References: From: Harsh J Date: Wed, 9 Apr 2014 00:00:57 +0530 Message-ID: Subject: Re: MapReduce for complex key/value pairs? To: "" Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Yes, you can write custom writable classes that detail and serialise your required data structure. If you have Hadoop: The Definitive Guide, checkout its section "Serialization" under chapter "Hadoop I/O". On Tue, Apr 8, 2014 at 9:16 PM, Natalia Connolly wrote: > Dear All, > > I was wondering if the following is possible using MapReduce. > > I would like to create a job that loops over a bunch of documents, > tokenizes them into ngrams, and stores the ngrams and not only the counts of > ngrams but also _which_ document(s) had this particular ngram. In other > words, the key would be the ngram but the value would be an integer (the > count) _and_ an array of document id's. > > Is this something that can be done? Any pointers would be appreciated. > > I am using Java, btw. > > Thank you, > > Natalia Connolly > -- Harsh J