Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3BA55200AE1 for ; Mon, 6 Jun 2016 19:02:33 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 39AFC160A24; Mon, 6 Jun 2016 17:02:33 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7DBB4160A1E for ; Mon, 6 Jun 2016 19:02:32 +0200 (CEST) Received: (qmail 70713 invoked by uid 500); 6 Jun 2016 17:02:31 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 70701 invoked by uid 99); 6 Jun 2016 17:02:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Jun 2016 17:02:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 9A9E11A03BC for ; Mon, 6 Jun 2016 17:02:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.448 X-Spam-Level: * X-Spam-Status: No, score=1.448 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id atWejPwZMSwL for ; Mon, 6 Jun 2016 17:02:29 +0000 (UTC) Received: from mail-io0-f175.google.com (mail-io0-f175.google.com [209.85.223.175]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 5501A5FB21 for ; Mon, 6 Jun 2016 17:02:29 +0000 (UTC) Received: by mail-io0-f175.google.com with SMTP id n127so22240012iof.3 for ; Mon, 06 Jun 2016 10:02:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=xItWyKP4FzXMdU1ywFETWK9vh3Vf+m8iVLbEg65IWP4=; b=qVYpFbgBJKctYUly5xRxF9WjBRg9uN+QUaOyV63oupG0t2Xc61hlTXVKQ3iHlo/3+2 f0mpxq450i8j0xUdjsyc/D13ZLpTCmjeKiEd2R926GP19Gw+EuPrOaqwLAMrJUaaEemJ sDgcZ8U1mtHln8We6xYPGl2NYKsYZQUIHbKrbXsgy+KoCgxXy7z/pvmtAt/H7IwqOZk5 qtKOjc5F22Ahy7STCxjQ9VqcJn31vARkfJ61NG60yKIlS40tkb1ptrmltnN32t7RpSln JZMGQ8HgwNv9WxBPhQ2xlk7qtCgnEtfvGJh2jhmgYjvybVNXd/1kJvvAJzRhEUVmhDeU 3YzA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=xItWyKP4FzXMdU1ywFETWK9vh3Vf+m8iVLbEg65IWP4=; b=HQT/cR1+YZuYBbUgDT67X2DcJdnfXkx1dFCmvWm6HDi0x89KDR4spU1JIbBLCS+OzB 82FHWqzfBVP2Pl018Y5ojrP8n/8J5wHjYBvniGgsbKLnf4glzFC16lgsCXXcPi5URze/ 6oNJwJJjbD+aHzq5OW0J5czK63h64UOrEohaZURoaPD0S/Khx33hCVmGoMQGAIJvOb+H 4Uv7rxW8iGTycEvb7AwNeXIJTObtqh+i1/kMAacYh+Mt1m6p32FaE3JutLTHdi2671qX 5/ibOmB0HD+VJsgB/KkFiL3d2UCKiVhTyBUho7VN+ANr8LZPkQC3VJDkhPA8hF6bYM+B 6+RA== X-Gm-Message-State: ALyK8tItfsBqO3kAGh1xjIA2oQyZbkDKMQZ3IauJ753BAF6qaPBnKXsvcoDWSsDb6i/P5Pir9N3XF0prEQ6NUg== X-Received: by 10.107.37.19 with SMTP id l19mr20823812iol.75.1465232548456; Mon, 06 Jun 2016 10:02:28 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.12.65 with HTTP; Mon, 6 Jun 2016 10:02:27 -0700 (PDT) In-Reply-To: References: From: Dmitriy Lyubimov Date: Mon, 6 Jun 2016 10:02:27 -0700 Message-ID: Subject: Re: mahout tf-idf vs lucene tf-idf To: "user@mahout.apache.org" Content-Type: multipart/alternative; boundary=001a11409d8e908d1505349f09e6 archived-at: Mon, 06 Jun 2016 17:02:33 -0000 --001a11409d8e908d1505349f09e6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable to add to Ted's reply, mahout has traditionally offered a bigram/trigram analysis as a part of its tf-idf conversion (a step away from the bag of words model so that directional statistically stable combinations of 2 or 3 words are reduced to their own term). However, this has not been ported to spark/h20/flink engines, and is available as a mapreduce legacy algorithm only. On Sat, Jun 4, 2016 at 2:14 AM, forme book wrote: > Hi, > > I'm start to study text processing and I see that for evaluating two text > is possible to obtaing vector model through TF-IDF technique. > > With Mahout is possible to create vectors from text with the use of > lucene.vector, if I have not misheard takes a lucene index and then map a= s > a tf-idf, > > On the (Lucene side) has already by default this implementations, what I = do > struggle to understand what is the advantage of having lucene.vector in > mahout when Lucene offer that feature out of the box ? > > Maybe I'm missing something big but what=E2=80=99s the Connection Between= then ? > could you please explain a possible user case ? > > Thanks for help > > Richard > --001a11409d8e908d1505349f09e6--