mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Hansen <dsche...@gmail.com>
Subject Re: building a (weighted) movie similarity measure
Date Wed, 14 Sep 2011 19:14:30 GMT
Freebase.com was actually pretty nice for getting data like this --
the problem was it took a good day just to figure out the basics of
their query language which the call MQL.  The basic concept is that
you pass them a request with a JSON object and they send you back the
same object with the blanks filled in so imagine in sql you would say

select x,y from table where x="somevalue";
somevalue  y1
somevalue  y2
somevalue  y3

instead you send over a request with a json object that looks like
query={"x":"somevalue","y":[]} and you get back a json object that
looks like
{"x":"somevalue","y":["y1","y2","y3"]}

The trick is knowing what kind of "blanks" to pass in your request
because sometimes you pass null (only one possible match), sometimes
you pass an empty list, and sometimes you pass in an empty hash.  They
also limit the results they send back to something like 100 by default
-- it took me the better half of a day to figure out you just had to
pass in limit=10000 (or whatever you want to set it to) to exceed the
default.  The other frustrating part was figuring out that somethings
came back as links to other documents and I couldn't figure out how to
get the text of that document.  It turned out that was a different
webservice api request altogether
(http://www.freebase.com/api/trans/raw instead of
http://www.freebase.com/api/service/mqlread)

Now that I've figured out how to use it though, I'll definitely use
freebase as a source of public data in the future.

On Wed, Sep 14, 2011 at 1:29 PM, web service <wbsrvc@gmail.com> wrote:
> is that catalog shareable ?
>
> On Wed, Sep 14, 2011 at 9:29 AM, eric konsirald <eric.konsirald@gmail.com>wrote:
>
>> Hi,
>>
>> i'm working on an experiment where i have a catalog of movies from IMDB
>>  containing all the metadata for each movie
>> (title/description/year/director/actors/etc...) and i would like to solve
>> the following problem:
>>
>> INPUT: a movie title (or id in imdb)
>> OUTPUT: the most "similar" movies
>>
>> but i have no user base or user activity, just the pure movie items.
>> so by "similar" i mean the movies having the most similar title and/or
>> description and/or director etc...
>>
>> i'm not sure how to build the appropriate global similarity measure, as for
>> description i could e.g. try to build a term vectors containing the most
>> frequent words (using e.g. tf/idf) or using lda, but then i have no clue
>> other than intuition to attribute e.g. more weight to the similarity
>> between
>> the description or the similarity between actors or e.g. the same year
>> (approximately) etc..
>>
>> is anyone has to deal with a similar problem or have any insights of how to
>> approach it?
>> also, is mahout contains any tools that would help me to build such a
>> (weighted) similarity measure and most importantly allow me to experiment
>> if
>> one similarity is better than another?
>>
>> thanks a lot in advance for any insights
>>
>> Eric
>>
>

Mime
View raw message