Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DE8E8110F6 for ; Tue, 8 Apr 2014 15:47:25 +0000 (UTC) Received: (qmail 36578 invoked by uid 500); 8 Apr 2014 15:47:17 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 35745 invoked by uid 500); 8 Apr 2014 15:47:12 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 35721 invoked by uid 99); 8 Apr 2014 15:47:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Apr 2014 15:47:09 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of natalia.v.connolly@gmail.com designates 209.85.128.174 as permitted sender) Received: from [209.85.128.174] (HELO mail-ve0-f174.google.com) (209.85.128.174) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Apr 2014 15:47:03 +0000 Received: by mail-ve0-f174.google.com with SMTP id oz11so918338veb.5 for ; Tue, 08 Apr 2014 08:46:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=PoE5Nw9VxcpvKusjS5Q6jCNSg3Hc2YkhE9wq0gSiSvA=; b=rqAN3U4Mk4J0UZ0B0r1D8bSyAh+sXWRSilGAucEr0Xk5pzKuKOMlESilpTGLAW6g7N o3Q+18t7ShUDSQOhqSK/ex+ahkfhT5+2A3OWABcMk+olXErK0XmAHm9wXTRiSTroESDC mlVDs8XsRaASzV1KtX8meDA5r3PQ1MVQBrHiddowiO3QAzVcXbDCrdd4Jq+o6DmfgIl8 DHuU+VCF9lQGEg0IWZLq8G6DaB108lCxMNVE1Cr9hYVALaJ+OVnXKrTO9kM3V3gCYdJJ gzGPbrdZYuV5yqwGm2HpBbfuYkDhKrq/u9UuaUHaRzg5StOVZFYmaMr4vfp1TWQ3rfuG ysdg== MIME-Version: 1.0 X-Received: by 10.52.3.129 with SMTP id c1mr665698vdc.37.1396972002946; Tue, 08 Apr 2014 08:46:42 -0700 (PDT) Received: by 10.58.146.68 with HTTP; Tue, 8 Apr 2014 08:46:42 -0700 (PDT) Date: Tue, 8 Apr 2014 11:46:42 -0400 Message-ID: Subject: MapReduce for complex key/value pairs? From: Natalia Connolly To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf30363731ff351d04f689e3e0 X-Virus-Checked: Checked by ClamAV on apache.org --20cf30363731ff351d04f689e3e0 Content-Type: text/plain; charset=UTF-8 Dear All, I was wondering if the following is possible using MapReduce. I would like to create a job that loops over a bunch of documents, tokenizes them into ngrams, and stores the ngrams and not only the counts of ngrams but also _which_ document(s) had this particular ngram. In other words, the key would be the ngram but the value would be an integer (the count) _and_ an array of document id's. Is this something that can be done? Any pointers would be appreciated. I am using Java, btw. Thank you, Natalia Connolly --20cf30363731ff351d04f689e3e0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Dear All,

=C2=A0 =C2=A0 I was wondering= if the following is possible using MapReduce.

=C2= =A0 =C2=A0 I would like to create a job that loops over a bunch of document= s, tokenizes them into ngrams, and stores the ngrams and not only the count= s of ngrams but also _which_ document(s) had this particular ngram. =C2=A0I= n other words, the key would be the ngram but the value would be an integer= (the count) _and_ an array of document id's. =C2=A0=C2=A0

=C2=A0 =C2=A0 Is this something that can be done? =C2= =A0Any pointers would be appreciated. =C2=A0

=C2= =A0 =C2=A0 I am using Java, btw.

=C2=A0 =C2=A0Than= k you,

=C2=A0 =C2=A0Natalia Connolly

--20cf30363731ff351d04f689e3e0--