Sebastian,
There needs to be a join of the two row similarity matrices to form
documents.
Pat,
What about just updating the document with the fields? Have three passes.
Pass 1 puts the normal metadata for the item in place. Pass2 updates
with data from B'B. Pass 3 udpates with data from B'A.
This will cause the entire index to be rewritten more than necessary, but
it should be fast enough to be a nonissue.
On other fronts, I got musicbrainz downloaded over the weekend and have
figured out the schema enough so that I think I can produce recording,
artist and tag information. From that, I can simulate user behavior and
produce logs to push into the demo system. That will allow realistic scale
and will allow users to explore the system in terms that they understand.
There is still a question of whether we can redistribute the musicbrainz
data, but I think I can arrange it so that anybody who wants to run the
demo will just download the necessary data themselves. I may host a
derived data product myself to simplify that process.
On Mon, Aug 5, 2013 at 10:59 AM, Sebastian Schelter <ssc@apache.org> wrote:
> I still don't understand why we need to rely on docids. If we simply index
> that row A is similar to rows B, C and D that should be fine, or am I
> wrong?
>
> 2013/8/5 Pat Ferrel <pat@occamsmachete.com>
>
> > I think m/r join is the best solution, too many assumptions otherwise. I
> > thought Ted wanted a nonm/r implementation, but oh, well, mostly
> nonm/r.
> > Is there a good example to start from in Mahout?
> >
> > Yes, one id field per doc. The problem is not storing, it is joining rows
> > from two DRMs by simple iteration.
> >
> > On Aug 5, 2013, at 10:27 AM, Sebastian Schelter <ssc@apache.org> wrote:
> >
> > If you use the same partitioning and number of reducers for creating the
> > outputs, the output should have the same number of sequence files and
> each
> > sequence file should have the same keys in descending order. I don't
> > understand why the ordering is a problem, can we not store the row index
> as
> > a field in solr?
> >
> > 2013/8/5 Ted Dunning <ted.dunning@gmail.com>
> >
> > > A quick mapreduce program should be able to join these matrices and
> > > produce documents ready to index.
> > >
> > >
> > > On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel <pat@occamsmachete.com>
> > wrote:
> > >
> > >> In writing the similarity matrices to Solr there is a bit of a
> problem.
> > >> The Matrices exist in two DRMs. The rows correspond to the doc IDs. As
> > > far
> > >> as I know there is no guarantee that the ids of both matrices are in
> the
> > >> same descending order.
> > >>
> > >> The easiest solution is to have an index for [B'B] and one for [B'A].
> > > That
> > >> means two or perhaps three queries for crossrecommendations, which is
> > > not
> > >> ideal.
> > >>
> > >> First I'm going to create two collections of docs with different field
> > >> idsthis should work and we can merge them later.
> > >>
> > >> Next we can do some m/r to group the docs by id so there is one
> > > collection
> > >> (csv) with one line per doc.
> > >>
> > >> Alternatively it is a possible that the DRMs can be iterated
> > >> simultaneously, which would also solve the problem. It assumes the
> order
> > > in
> > >> both DRMs is the same, descending by Key = item ID. Even if a row is
> > >> missing in one or the other this would work.
> > >>
> > >> Does anyone know if the DRMs are guaranteed to have row ordering by
> Key?
> > >> RSJ creates [B'B] and matrix multiply creates [B'A]
> > >>
> > >>
> > >> On Aug 2, 2013, at 11:14 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> > >>
> > >> Yes. We need two different sets of documents if the row space of the
> > >> cross/cooccurrence matrices are different as is the case with A'B and
> > > B'B.
> > >>
> > >> This could mean two indexes.
> > >>
> > >> Or a single index with a special field to indicate what type of record
> > > you
> > >> have.
> > >>
> > >>
> > >> On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel <pat@occamsmachete.com>
> > > wrote:
> > >>
> > >>> Thanks, well put.
> > >>>
> > >>> In order to have the ultimate impl with two id spaces for A and B
> would
> > >> we
> > >>> have to create different docs for A'B and B'B? Since the docs IDs
> must
> > >> come
> > >>> from A or B? The fields can contain different sets of IDs but the Doc
> > > ID
> > >>> must be one or the other, right? Doesn't this imply separate indexes
> > > for
> > >>> the separate A, B item IDs spaces? This is not a question for this
> > > first
> > >>> cut impl but is a generalization question.
> > >>>
> > >>> On Aug 2, 2013, at 2:06 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> > >>>
> > >>> So there is a lot of good discussion here and there were some key
> > > ideas.
> > >>>
> > >>> The first idea is that the *input* to a recommender is on the right
> in
> > >> the
> > >>> matrix notation. This refers inherently to the id's on the columns
> of
> > >> the
> > >>> recommender product (either B'B or B'A). The columns are defined by
> > > the
> > >>> right hand element of the product (either B or A in the B'B and B'A
> > >>> respectively).
> > >>>
> > >>> The results are in the row space and are defined by the left hand
> > > operand
> > >>> of the product. IN the case of B'A and B'B, the left hand operand
> is B
> > >> in
> > >>> both cases so the row space is consistent.
> > >>>
> > >>> In order to implement this in a search engine, we need documents that
> > >>> correspond to rows of B'A or B'B. These are the same as the columns
> of
> > >> B.
> > >>> The fields of the documents will necessarily include the following:
> > >>>
> > >>> id: the column id from B corresponding to this item
> > >>> description: presentation info ... yada yada
> > >>> balinks: contents of this row of B'A expressed as id's from the
> > > column
> > >>> space of A where this row of llrfilter(B'A)
> contains
> > > a
> > >>> nonzero value.
> > >>> bblinks: contents of this row of B'B expressed as id's from the
> > > column
> > >>> space of B ...
> > >>>
> > >>>
> > >>> The following operations are now single queries:
> > >>>
> > >>> get an item where id = x
> > >>> query is [id:x]
> > >>>
> > >>> recommend based on behavior with regard to A items and actions h_a
> > >>> query is [balinks: h_a]
> > >>>
> > >>> recommend based on behavior with regard to B items and actions h_b
> > >>> query is [bblinks: h_b]
> > >>>
> > >>> recommend based on a single item with id = x
> > >>> query is [bblinks: x]
> > >>>
> > >>> recommend based on composite behavior composed of h_a and h_b
> > >>> query is [balinks: h_a bblinks: h_b]
> > >>>
> > >>> Does this make sense by being more explicit?
> > >>>
> > >>> Now, it is pretty clear that we could have an index of A objects as
> > > well
> > >>> but the link fields would have to be aalinks and ablinks, of
> > > course.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel <pat.ferrel@gmail.com>
> > > wrote:
> > >>>
> > >>>> Assuming Ted needs to call it, not sure if an invite has gone out,
I
> > >>>> haven't seen one.
> > >>>>
> > >>>> On Aug 2, 2013, at 12:49 PM, B Lyon <bradflyon@gmail.com>
wrote:
> > >>>>
> > >>>> I am planning on sitting in as flaky connection allows.
> > >>>> On Aug 2, 2013 3:21 PM, "Pat Ferrel" <pat.ferrel@gmail.com>
wrote:
> > >>>>
> > >>>>> We doing a hangout at 2 on the Solr recommender?
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >>
> > >
> >
> >
>
