lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: Get distinct results in Solr
Date Tue, 01 Sep 2015 13:57:50 GMT
Hi Upayavira,

Yes, I tried with a completely new index. I found that once I added the
line below to my /update handler in solrconfig.xml, the indexing doesn't
work anymore.
<str name="update.chain">dedupe</str>

Besides that, it is also not able to do any deletion to the index when this
line is added.

Regards,
Edwin




On 1 September 2015 at 21:15, Upayavira <uv@odoko.co.uk> wrote:

> Have you tried with a completely clean index? Are you deduping, or just
> calculating the signature? Is it possible dedup is preventing your
> documents from indexing (because it thinks they are dups)?
>
> On Tue, Sep 1, 2015, at 09:46 AM, Zheng Lin Edwin Yeo wrote:
> > Hi Upayavira,
> >
> > I've tried to change <str name="signatureField">id</str> to be <str
> > name="signatureField">signature</str>, but nothing is indexed into Solr
> > as
> > well. Is that what you mean?
> >
> > Besides that, I've also included a copyField to copy the content field
> > into
> > the signature field. Both versions (with and without copyField) have
> > nothing indexed into Solr.
> >
> > Regards,
> > Edwin
> >
> >
> > On 1 September 2015 at 15:48, Upayavira <uv@odoko.co.uk> wrote:
> >
> > > you are attempting to write your signature to your ID field. That's not
> > > a good idea. You are generating your signature from the content field,
> > > which seems okay. Change your <str name="signatureField">id</str>
to be
> > > your 'signature' field instead of id, and something different will
> > > happen :-)
> > >
> > > Upayavira
> > >
> > > On Tue, Sep 1, 2015, at 04:34 AM, Zheng Lin Edwin Yeo wrote:
> > > > I tried to follow the de-duplication guide, but after I configured
> it in
> > > > solrconfig.xml and schema.xml, nothing is indexed into Solr, and
> there is
> > > > no error message. I'm using SimplePostTool to index rich-text
> documents.
> > > >
> > > > Below are my configurations:
> > > >
> > > > In solrconfig.xml
> > > >
> > > >   <requestHandler name="/update" class="solr.UpdateRequestHandler">
> > > >  <lst name="defaults">
> > > > <str name="update.chain">dedupe</str>
> > > >  </lst>
> > > >   </requestHandler>
> > > >
> > > >     <updateRequestProcessorChain name="dedupe">
> > > >  <processor class="solr.processor.SignatureUpdateProcessorFactory">
> > > > <bool name="enabled">true</bool>
> > > > <str name="signatureField">id</str>
> > > > <bool name="overwriteDupes">false</bool>
> > > > <str name="fields">content</str>
> > > > <str name="signatureClass">solr.processor.Lookup3Signature</str>
> > > >  </processor>
> > > >     </updateRequestProcessorChain>
> > > >
> > > >
> > > > In schema.xml
> > > >
> > > >  <field name="signature" type="string" stored="true" indexed="true"
> > > > multiValued="false" />
> > > >
> > > >
> > > > Is there anything which I might have missed out or done wrongly?
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > > >
> > > > On 1 September 2015 at 10:46, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> > > > wrote:
> > > >
> > > > > Thank you for your advice Alexandre.
> > > > >
> > > > > Will try out the de-duplication from the link you gave.
> > > > >
> > > > > Regards,
> > > > > Edwin
> > > > >
> > > > >
> > > > > On 1 September 2015 at 10:34, Alexandre Rafalovitch <
> > > arafalov@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Re-read the question. You want to de-dupe on the full
> text-content.
> > > > >>
> > > > >> I would actually try to use the dedupe chain as per the link
I
> gave
> > > > >> but put results into a separate string field. Then, you group
on
> that
> > > > >> field. You cannot actually group on the long text field, that
> would
> > > > >> kill any performance. So a signature is your proxy.
> > > > >>
> > > > >> Regards,
> > > > >>    Alex
> > > > >> ----
> > > > >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > > > >> http://www.solr-start.com/
> > > > >>
> > > > >>
> > > > >> On 31 August 2015 at 22:26, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> > > >
> > > > >> wrote:
> > > > >> > Hi Alexandre,
> > > > >> >
> > > > >> > Will treating it as String affect the search or other functions
> like
> > > > >> > highlighting?
> > > > >> >
> > > > >> > Yes, the content must be in my index, unless I do a copyField
> to do
> > > > >> > de-duplication on that field.. Will that help?
> > > > >> >
> > > > >> > Regards,
> > > > >> > Edwin
> > > > >> >
> > > > >> >
> > > > >> > On 1 September 2015 at 10:04, Alexandre Rafalovitch <
> > > arafalov@gmail.com
> > > > >> >
> > > > >> > wrote:
> > > > >> >
> > > > >> >> Can't you just treat it as String?
> > > > >> >>
> > > > >> >> Also, do you actually want those documents in your index
in the
> > > first
> > > > >> >> place? If not, have you looked at De-duplication:
> > > > >> >>
> https://cwiki.apache.org/confluence/display/solr/De-Duplication
> > > > >> >>
> > > > >> >> Regards,
> > > > >> >>    Alex.
> > > > >> >> ----
> > > > >> >> Solr Analyzers, Tokenizers, Filters, URPs and even a
> newsletter:
> > > > >> >> http://www.solr-start.com/
> > > > >> >>
> > > > >> >>
> > > > >> >> On 31 August 2015 at 22:00, Zheng Lin Edwin Yeo <
> > > edwinyeozl@gmail.com>
> > > > >> >> wrote:
> > > > >> >> > Thanks Jan.
> > > > >> >> >
> > > > >> >> > But I read that the field that is being collapsed
on must be
> a
> > > single
> > > > >> >> > valued String, Int or Float. As I'm required to
get the
> distinct
> > > > >> results
> > > > >> >> > from "content" field that was indexed from a rich
text
> document,
> > > I
> > > > >> got
> > > > >> >> the
> > > > >> >> > following error:
> > > > >> >> >
> > > > >> >> >   "error":{
> > > > >> >> >     "msg":"java.io.IOException: 64 bit numeric
collapse
> fields
> > > are
> > > > >> not
> > > > >> >> > supported",
> > > > >> >> >     "trace":"java.lang.RuntimeException:
> java.io.IOException: 64
> > > bit
> > > > >> >> > numeric collapse fields are not supported\r\n\tat
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > Is it possible to collapsed on fields which has
a long
> integer of
> > > > >> data,
> > > > >> >> > like content from a rich text document?
> > > > >> >> >
> > > > >> >> > Regards,
> > > > >> >> > Edwin
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > On 31 August 2015 at 18:59, Jan Høydahl <
> jan.asf@cominvent.com>
> > > > >> wrote:
> > > > >> >> >
> > > > >> >> >> Hi
> > > > >> >> >>
> > > > >> >> >> Check out the CollapsingQParser (
> > > > >> >> >>
> > > > >> >>
> > > > >>
> > >
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> > > > >> >> ).
> > > > >> >> >> As long as you have a field that will be the
same for all
> > > > >> duplicates,
> > > > >> >> you
> > > > >> >> >> can “collapse” on that field. If you not
have a “group id”,
> you
> > > can
> > > > >> >> create
> > > > >> >> >> one using e.g. an MD5 signature of the identical
body text (
> > > > >> >> >>
> https://cwiki.apache.org/confluence/display/solr/De-Duplication
> > > ).
> > > > >> >> >>
> > > > >> >> >> --
> > > > >> >> >> Jan Høydahl, search solution architect
> > > > >> >> >> Cominvent AS - www.cominvent.com
> > > > >> >> >>
> > > > >> >> >> > 31. aug. 2015 kl. 12.03 skrev Zheng Lin
Edwin Yeo <
> > > > >> >> edwinyeozl@gmail.com
> > > > >> >> >> >:
> > > > >> >> >> >
> > > > >> >> >> > Hi,
> > > > >> >> >> >
> > > > >> >> >> > I'm using Solr 5.2.1, and I would like
to find out, what
> is
> > > the
> > > > >> best
> > > > >> >> way
> > > > >> >> >> to
> > > > >> >> >> > get Solr to return only distinct results?
> > > > >> >> >> >
> > > > >> >> >> > Currently, I've indexed several exact
similar documents
> into
> > > Solr,
> > > > >> >> with
> > > > >> >> >> > just different id and title, but the content
is exactly
> the
> > > same.
> > > > >> >> When I
> > > > >> >> >> do
> > > > >> >> >> > a search, Solr will return all these documents
several
> time
> > > in the
> > > > >> >> list.
> > > > >> >> >> >
> > > > >> >> >> > What is the most suitable way to get Solr
to return only
> one
> > > of
> > > > >> the
> > > > >> >> >> > document during the search?
> > > > >> >> >> > I understand that there is result grouping
and faceting,
> but
> > > I'm
> > > > >> not
> > > > >> >> sure
> > > > >> >> >> > if that is the best way.
> > > > >> >> >> >
> > > > >> >> >> > Regards,
> > > > >> >> >> > Edwin
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >>
> > > > >>
> > > > >
> > > > >
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message