Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 11110D7D1 for ; Sun, 19 May 2013 18:02:05 +0000 (UTC) Received: (qmail 82563 invoked by uid 500); 19 May 2013 18:02:03 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 82519 invoked by uid 500); 19 May 2013 18:02:03 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 82511 invoked by uid 99); 19 May 2013 18:02:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 19 May 2013 18:02:03 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS,TO_NO_BRKTS_PCNT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of manuel.blechschmidt@gmx.de designates 212.227.15.19 as permitted sender) Received: from [212.227.15.19] (HELO mout.gmx.net) (212.227.15.19) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 19 May 2013 18:01:58 +0000 Received: from mailout-de.gmx.net ([10.1.76.20]) by mrigmx.server.lan (mrigmx002) with ESMTP (Nemesis) id 0LxrgI-1USFt73WR2-015GgJ for ; Sun, 19 May 2013 20:01:37 +0200 Received: (qmail invoked by alias); 19 May 2013 18:01:37 -0000 Received: from unknown (EHLO manuel-blechschmidts-macbook-pro.fritz.box) [24.134.71.3] by mail.gmx.net (mp020) with SMTP; 19 May 2013 20:01:37 +0200 X-Authenticated: #2167237 X-Provags-ID: V01U2FsdGVkX1+OCEwOFK6OyFA50LJeM5LWPglsfrmzVCWPXEdSCl ySU6CgRqF13P8N Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1283) Subject: Re: Which database should I use with Mahout From: Manuel Blechschmidt In-Reply-To: Date: Sun, 19 May 2013 20:01:37 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <6F9272F7-D39A-4D3D-A3B1-5F590567E5BF@gmx.de> References: <1368982327.55670.YahooMailNeo@web140005.mail.bf1.yahoo.com> To: user@mahout.apache.org X-Mailer: Apple Mail (2.1283) X-Y-GMX-Trusted: 0 X-Virus-Checked: Checked by ClamAV on apache.org Hi Tevfik, one request to the recommender could become more then 1000 queries to = the database depending on which recommender you use and the amount of = preferences for the given user. The problem is not if you are using SQL, NoSQL, or any other query = language. The problem is the latency of the answers. An average tcp package in the same data center takes 500 =B5s. A main = memory reference 0,1 =B5s. This means that your main memory of your java = process can be accessed 5000 times faster then any other process like a = database connected via TCP/IP. http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html Here you can see a screenshot that shows that database communication is = by far (99%) the slowest component of a recommender request: https://source.apaxo.de/MahoutDatabaseLowPerformance.png If you do not want to cache your data in your Java process you can use a = complete in memory database technology like SAP HANA = http://www.saphana.com/welcome or EXASOL http://www.exasol.com/ Nevertheless if you are using these you do not need Mahout anymore. An architecture of a Mahout system can be seen here: = https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/Reco= mmenderArchitecture.png Hope that helps Manuel Am 19.05.2013 um 19:20 schrieb Sean Owen: > I'm first saying that you really don't want to use the database as a > data model directly. It is far too slow. > Instead you want to use a data model implementation that reads all of > the data, once, serially, into memory. And in that case, it makes no > difference where the data is being read from, because it is read just > once, serially. A file is just as fine as a fancy database. In fact > it's probably easier and faster. >=20 > On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin > wrote: >> Thanks Sean, but I could not get your answer. Can you please explain = it again? >>=20 >>=20 >> On Sun, May 19, 2013 at 8:00 PM, Sean Owen wrote: >>> It doesn't matter, in the sense that it is never going to be fast >>> enough for real-time at any reasonable scale if actually run off a >>> database directly. One operation results in thousands of queries. = It's >>> going to read data into memory anyway and cache it there. So, = whatever >>> is easiest for you. The simplest solution is a file. >>>=20 >>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz >>> wrote: >>>> Hi, >>>> I would like to use Mahout to make recommendations on my web site. = Since the data is going to be big, hopefully, I plan to use hadoop = implementations of the recommender algorithms. >>>>=20 >>>> I'm currently storing the data in mysql. Should I continue with it = or should I switch to a nosql database such as mongodb or something = else? >>>>=20 >>>> Thanks >>>> Ahmet --=20 Manuel Blechschmidt M.Sc. IT Systems Engineering Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B