lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Szűcs Roland <szucs.rol...@bookandwalk.hu>
Subject Re: MoreLikeThisHandler with mltipli input documents
Date Wed, 30 Sep 2015 08:21:22 GMT
Hi Alessandro,

Exactly. The response time varies but let's have a concrete other example.
This is my call: http://localhost:8983/solr/bandwpl/mlt?q=id:10812&fl=id

This is my result:

{
  "responseHeader":{
    "status":0,
    "QTime":6232},
  "response":{"numFound":4564,"start":0,"docs":[
      {
        "id":"11335"},
      {
        "id":"14984"},
      {
        "id":"13948"},
      {
        "id":"11105"},
      {
        "id":"12122"},
      {
        "id":"12315"},
      {
        "id":"19145"},
      {
        "id":"11843"},
      {
        "id":"11640"},
      {
        "id":"19053"}]
  },
  "interestingTerms":[
    "content:hinduski",1.0,
    "content:hindus",1.0174515,
    "content:głowa",1.0453196,
    "content:życie",1.0666888,
    "content:czas",1.0824177,
    "content:kobieta",1.0927386,
    "content:indie",1.119314,
    "content:quentin",1.1349105,
    "content:madras",1.239089,
    "content:musieć",1.2626213,
    "content:matka",1.2966589,
    "content:chcieć",1.299024,
    "content:domu",1.3370595,
    "content:stać",1.4053295,
    "content:sari",1.4284334,
    "content:ojciec",1.4596463,
    "content:lindsay",1.5857035,
    "content:wiedzieć",1.6952671,
    "content:powiedzieć",1.8430523,
    "content:baba",1.8915937,
    "content:mieć",2.1113522,
    "content:Nata",2.4373012,
    "content:Gopal",2.518996,
    "content:david",3.0211911,
    "content:Trixie",7.082156]}


Cheers,

Roland


2015-09-30 10:16 GMT+02:00 Alessandro Benedetti <benedetti.alex85@gmail.com>
:

> I am still missing why you quote the number of the documents...
> If you have 5600 polish books, but you use the MLT only when you land in
> the page of a specific book ...
> I think i still miss the point !
> MLT on 1 polish book, takes 7 secs ?
>
>
> 2015-09-30 9:10 GMT+01:00 Szűcs Roland <szucs.roland@bookandwalk.hu>:
>
> > Hi Alessandro,
> >
> > You are right. I forget to mention one important factor. For 3000
> hungarian
> > e-books the approach you mentioned is absolutely fine as the response
> time
> > is some 0.7 sec. But when I use the same mlt for 5600 polish e-books the
> > response time is 7 sec which is definetely not acceptable for the users.
> >
> > Regards,
> > Roland
> >
> > 2015-09-29 17:19 GMT+02:00 Alessandro Benedetti <
> > benedetti.alex85@gmail.com>
> > :
> >
> > > Hi Roland,
> > > you said "The main goal is that when a customer is on the pruduct page
> ".
> > > But if you are in a  product page, I guess you have the product Id.
> > > If you have the product id , you can simply execute the MLT request
> with
> > > the single Doc Id in input.
> > >
> > > Why do you need to calculate beforehand?
> > >
> > > Cheers
> > >
> > > 2015-09-29 15:44 GMT+01:00 Szűcs Roland <szucs.roland@bookandwalk.hu>:
> > >
> > > > Hello Upayavira,
> > > >
> > > > The main goal is that when a customer is on the pruduct page on an
> > e-book
> > > > and he does not like it somehow I want to immediately offer her/him
> > > > alternative e-books in the same topic. If I expect from the customer
> to
> > > > click on a button like "similar e-books" I lose half of them as they
> > are
> > > > lazy to click anywhere. So I would like to present on the product
> pages
> > > the
> > > > alternatives of the e-books  without clicking.
> > > >
> > > > I assumed the best idea to claculate the similar e-books for all the
> > > other
> > > > (n*(n-1) similarity calculation) and present only the top 5. I
> planned
> > to
> > > > do it when our server is not busy. In this point I found the
> > description
> > > of
> > > > mlt as a search component which seemed to be a good candidate as it
> > > > calculates the similar documents to all the result set of the query.
> So
> > > if
> > > > I say q=*:* and mlt component is enabled I get similar document for
> my
> > > > entire document set. The only problem was with this approach that mlt
> > > > search component does not give back the interesting terms for my tag
> > > cloud
> > > > calculation.
> > > >
> > > > That's why I tried to mix the flexibility of mlt compoonent (multiple
> > > docs
> > > > as an input accepted) with the robustness of MoreLikeThisHandler
> > (having
> > > > interesting terms).
> > > >
> > > > If there is no solution, I will use the mlt component and solve the
> tag
> > > > cloud calculation other way. By the way if I am not mistaken, the
> 5.3.1
> > > > version takes the union of the feature set of the mlt component, and
> > > > handler
> > > >
> > > > Best Regards,
> > > > Roland
> > > >
> > > >
> > > >
> > > > 2015-09-29 14:38 GMT+02:00 Upayavira <uv@odoko.co.uk>:
> > > >
> > > > > Let's take a step back. So, you have 3000 or so docs, and you want
> to
> > > > > know which documents are similar to these.
> > > > >
> > > > > Why do you want to know this? What feature do you need to build
> that
> > > > > will use that information? Knowing this may help us to arrive at
> the
> > > > > right technology for you.
> > > > >
> > > > > For example, you might want to investigate offline clustering
> > > algorithms
> > > > > (e.g. [1], which might be a bit dense to follow). A good book on
> > > machine
> > > > > learning if you are okay with Python is "Programming Collective
> > > > > Intelligence" as it explains the usual algorithms with simple for
> > loops
> > > > > making it very clear.
> > > > >
> > > > > Or, you could do searches, and then cluster the results at search
> > time
> > > > > (so if you search for 100 docs, it will identify clusters within
> > those
> > > > > 100 matching documents). That might get you there. See [2]
> > > > >
> > > > > So, if you let us know what the end-goal is, perhaps we can suggest
> > an
> > > > > alternative approach, rather than burying ourselves neck-deep in
> MLT
> > > > > problems.
> > > > >
> > > > > Upayavira
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html
> > > > > [2]
> > https://cwiki.apache.org/confluence/display/solr/Result+Clustering
> > > > >
> > > > > On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote:
> > > > > > Hello Upayavira,
> > > > > >
> > > > > > Thanks dealing with my issue. I have applied already the
> > > > termVectors=true
> > > > > > to all fileds involved in the more like this calculation. I
have
> > > just 3
> > > > > > 000
> > > > > > documents each of them is represented by a relativly big term
> > vector
> > > > with
> > > > > > more than 20 000 unique terms. If I run the more like this
> handler
> > > for
> > > > a
> > > > > > solr doc it takes close to 1 sec to get back the first 10 similar
> > > > > > documents. Aftwr this I have to pass the docid-s to my other
> > > > application
> > > > > > which find the cover of the e-book and other metadata and put
it
> on
> > > the
> > > > > > web. The end-to-end process takes too much time from customer
> > > > perspective
> > > > > > that is why I tried to find solution for offline more like this
> > > > > > calculation. But if my app has to call the morelikethishandler
> for
> > > each
> > > > > > doc
> > > > > > it puts overhead for the offline calculation.
> > > > > >
> > > > > > Best Regards,
> > > > > > Roland
> > > > > >
> > > > > > 2015-09-29 13:01 GMT+02:00 Upayavira <uv@odoko.co.uk>:
> > > > > >
> > > > > > > If MoreLikeThis is slow for large documents that are indexed,
> > have
> > > > you
> > > > > > > enabled term vectors on the similarity fields?
> > > > > > >
> > > > > > > Basically, what more like this does is this:
> > > > > > >
> > > > > > > * decide on what terms in the source doc are "interesting",
and
> > > pick
> > > > > the
> > > > > > > 25 most interesting ones
> > > > > > > * build and execute a boolean query using these interesting
> > terms.
> > > > > > >
> > > > > > > Looking at the first phase of this in more detail:
> > > > > > >
> > > > > > > If you pass in a document using stream.body, it will analyse
> this
> > > > > > > document into terms, and then calculate the most interesting
> > terms
> > > > from
> > > > > > > that.
> > > > > > >
> > > > > > > If you reference document in your index with a field that
is
> > > stored,
> > > > it
> > > > > > > will take the stored version, and analyse it and identify
the
> > > > > > > interesting terms from there.
> > > > > > >
> > > > > > > If, however, you have stored term vectors against that
field,
> > this
> > > > work
> > > > > > > is not needed. You have already done much of the work,
and the
> > > > > > > identification of your "interesting terms" will be much
faster.
> > > > > > >
> > > > > > > Thus, on the content field of your documents, add
> > > termVectors="true"
> > > > in
> > > > > > > your schema, and re-index. Then you could well find MLT
> becoming
> > a
> > > > lot
> > > > > > > more efficient.
> > > > > > >
> > > > > > > Upayavira
> > > > > > >
> > > > > > > On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote:
> > > > > > > > Hi Alessandro,
> > > > > > > >
> > > > > > > > My original goal was to get offline suggestsion on
content
> > based
> > > > > > > > similarity
> > > > > > > > for every e-book we have . We wanted to run a bulk
more like
> > this
> > > > > > > > calculation in the evening when the usage of our site
is low
> > and
> > > we
> > > > > > > > submit
> > > > > > > > a new e-book. Real time more like this can take a
while as we
> > > have
> > > > > > > > typically long documents (2-5MB text) with all the
content
> > > indexed.
> > > > > > > >
> > > > > > > > When we upload a new document we wanted to recalculate
the
> more
> > > > like
> > > > > this
> > > > > > > > suggestions and a tf-idf based tag cloouds. Both of
them are
> > > > > delivered by
> > > > > > > > the More LikeThisHandler but only for one document
as you
> > wrote.
> > > > > > > >
> > > > > > > > The text input is not good for us because we need
the similar
> > doc
> > > > > list
> > > > > > > > for
> > > > > > > > each of the matched document. If I put together text
of 10
> > > document
> > > > > I can
> > > > > > > > not separate which suggestion relates to which matched
> document
> > > and
> > > > > also
> > > > > > > > the tag cloud will belong to the mixed text.
> > > > > > > >
> > > > > > > > Most likley we will use the MoreLikeThisHandler for
each of
> the
> > > > > documents
> > > > > > > > and parse the json repsonse and store the result in
a DQL
> > > database
> > > > > > > >
> > > > > > > > Thanks your help.
> > > > > > > >
> > > > > > > > 2015-09-29 11:18 GMT+02:00 Alessandro Benedetti
> > > > > > > > <benedetti.alex85@gmail.com>
> > > > > > > > :
> > > > > > > >
> > > > > > > > > Hi Roland,
> > > > > > > > > what is your exact requirement ?
> > > > > > > > > Do you want to basically build a "description"
for a set of
> > > > > documents
> > > > > > > and
> > > > > > > > > then find documents in the index, similar to
this
> > description ?
> > > > > > > > >
> > > > > > > > > By default , based on my experience ( and on
the code) this
> > is
> > > > the
> > > > > > > entry
> > > > > > > > > point for the Lucene More Like This :
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > *org.apache.lucene.queries.mlt.MoreLikeThis/***
Return a
> > > query
> > > > > that
> > > > > > > will
> > > > > > > > > > return docs like the passed lucene document
ID.** @param
> > > docNum
> > > > > the
> > > > > > > > > > documentID of the lucene doc to generate
the 'More Like
> > This"
> > > > > query
> > > > > > > for.*
> > > > > > > > > > @return a query that will return docs like
the passed
> > lucene
> > > > > document
> > > > > > > > > > ID.*/public Query like(int docNum) throws
IOException {if
> > > > > > > (fieldNames ==
> > > > > > > > > > null) {// gather list of valid fields from
> > > > > luceneCollection<String>
> > > > > > > > > fields
> > > > > > > > > > = MultiFields.getIndexedFields(ir);fieldNames
=
> > > > > fields.toArray(new
> > > > > > > > > > String[fields.size()]);}return
> > > > > createQuery(retrieveTerms(docNum));}*
> > > > > > > > >
> > > > > > > > > It means that talking about "documents" you can
feed only
> one
> > > > Solr
> > > > > doc.
> > > > > > > > >
> > > > > > > > > But you can also feed the MLT with simple text.
> > > > > > > > >
> > > > > > > > > So you should study better your use case and
understand
> which
> > > > > option
> > > > > > > > > fits better :
> > > > > > > > >
> > > > > > > > > 1) customising the MLT component starting from
Lucene
> > > > > > > > >
> > > > > > > > > 2) doing some processing client side and use
the "text"
> > > > similarity
> > > > > > > feature.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Cheers
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2015-09-29 10:05 GMT+01:00 Roland Szűcs <
> > > > > roland.szucs@bookandwalk.com
> > > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Is it possible to feed multiple solr id
for a
> > > > > MoreLikeThisHandler?
> > > > > > > > > >
> > > > > > > > > > <requestHandler name="/mlt"
> > class="solr.MoreLikeThisHandler">
> > > > > > > > > > <lst name="defaults">
> > > > > > > > > > <str name="mlt.match.include">false</str>
> > > > > > > > > > <str name="mlt.interestingTerms">details</str>
> > > > > > > > > > <str name="mlt.fl">title,content</str>
> > > > > > > > > > <str name="mlt.minwl">4</str>
> > > > > > > > > > <str name="mlt.qf">title^12 content^1</str>
> > > > > > > > > > <str name="mlt.mintf">2</str>
> > > > > > > > > > <int name="mlt.count">10</int>
> > > > > > > > > > <str name="mlt.boost">true</str>
> > > > > > > > > > <str name="wt">json</str>
> > > > > > > > > > <str name="indent">true</str>
> > > > > > > > > > </lst>
> > > > > > > > > >   </requestHandler>
> > > > > > > > > >
> > > > > > > > > > when I call this:
> > > > > > > http://localhost:8983/solr/bandwhu/mlt?q=id:8&fl=id
> > > > > > > > > >  it works fine. Is there any way to have
a kind of "bulk"
> > > call
> > > > of
> > > > > > > more
> > > > > > > > > like
> > > > > > > > > > this handler . I need the intresting terms
as well and as
> > far
> > > > as
> > > > > I
> > > > > > > know
> > > > > > > > > if
> > > > > > > > > > i use more like this as a search component
it does not
> > return
> > > > > with
> > > > > > > it so
> > > > > > > > > it
> > > > > > > > > > is not an alternative.
> > > > > > > > > >
> > > > > > > > > > Thanks in advance,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > <
> > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > > >Roland
> > > > > > > > > Szűcs
> > > > > > > > > > <
> > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > > >Connect
> > > > > > > > > with
> > > > > > > > > > me on Linkedin <
> > > > > > > > > >
> > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > > > > > > <https://bookandwalk.hu/>CEOPhone:
+36 1 210 81
> > > > 13Bookandwalk.hu
> > > > > > > > > > <https://bokandwalk.hu/>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > --------------------------
> > > > > > > > >
> > > > > > > > > Benedetti Alessandro
> > > > > > > > > Visiting card - http://about.me/alessandro_benedetti
> > > > > > > > > Blog - http://alexbenedetti.blogspot.co.uk
> > > > > > > > >
> > > > > > > > > "Tyger, tyger burning bright
> > > > > > > > > In the forests of the night,
> > > > > > > > > What immortal hand or eye
> > > > > > > > > Could frame thy fearful symmetry?"
> > > > > > > > >
> > > > > > > > > William Blake - Songs of Experience -1794 England
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > >Szűcs
> > > > > > > Roland
> > > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > > > > >Ismerkedjünk
> > > > > > > > meg a Linkedin
> > > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> >
> > > > > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon:
+36 1 210 81
> > > > > > > > 13Bookandwalk.hu
> > > > > > > > <https://bokandwalk.hu/>
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> >Szűcs
> > > > > Roland
> > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > > >Ismerkedjünk
> > > > > > meg a Linkedin
> > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1
210 81
> > > > > > 13Bookandwalk.hu
> > > > > > <https://bokandwalk.hu/>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> > > Roland
> > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > >Ismerkedjünk
> > > > meg a Linkedin <
> > > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > > 13Bookandwalk.hu
> > > > <https://bokandwalk.hu/>
> > > >
> > >
> > >
> > >
> > > --
> > > --------------------------
> > >
> > > Benedetti Alessandro
> > > Visiting card - http://about.me/alessandro_benedetti
> > > Blog - http://alexbenedetti.blogspot.co.uk
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
> >
> >
> > --
> > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> Roland
> > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> >Ismerkedjünk
> > meg a Linkedin <
> > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > 13Bookandwalk.hu
> > <https://bokandwalk.hu/>
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
<https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs Roland
<https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Ismerkedjünk
meg a Linkedin <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
-en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81 13Bookandwalk.hu
<https://bokandwalk.hu/>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message