Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ECC32D4F7 for ; Mon, 12 Nov 2012 14:14:52 +0000 (UTC) Received: (qmail 98465 invoked by uid 500); 12 Nov 2012 14:14:51 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 98382 invoked by uid 500); 12 Nov 2012 14:14:51 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 98365 invoked by uid 99); 12 Nov 2012 14:14:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Nov 2012 14:14:50 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jymysy@gmail.com designates 209.85.215.176 as permitted sender) Received: from [209.85.215.176] (HELO mail-ea0-f176.google.com) (209.85.215.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Nov 2012 14:14:42 +0000 Received: by mail-ea0-f176.google.com with SMTP id n12so2681401eaa.35 for ; Mon, 12 Nov 2012 06:14:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=gumJGH+7dQgrkdjXaD79AxiYCUGYOaBU7wbmtdwqWlI=; b=uwZjghzfGHetltTINhCl89wiL4NrkqRw3yOaGxYrS14o8WpvsBpQKFc8O3U9xAqCJ3 2c10nQ0jUry/RLLyhep+qchMZm5qj0bnC9fU6Th2G93Xh786XN+pdPaLb4UThhpvIlWu zdlJHUot1yJZyhpLJNXEVPf/06qZ5fLmmE9Tboj4JMLlX3Qmftx0QIxZecv4xxrKaA75 gwDgEBGHnWEP8jeInyJsK2GLpZT1SRdwDcQtzayoXohB34Itk/xfNR+6cth6EZXvFmB6 q71VI76mAVLSXEY6SAWpbdqQHgOkTir3T6FOSJFiyFOketT3lZp5sXvBruS6crTb+E1j 1mRQ== Received: by 10.14.213.65 with SMTP id z41mr62835427eeo.29.1352729662176; Mon, 12 Nov 2012 06:14:22 -0800 (PST) MIME-Version: 1.0 Received: by 10.14.189.4 with HTTP; Mon, 12 Nov 2012 06:14:01 -0800 (PST) In-Reply-To: References: <945669218-1352601671-cardhu_decombobulator_blackberry.rim.net-1256465021-@b18.c1.bise6.blackberry> From: =?ISO-8859-1?Q?Jimmy_S=E9lamy?= Date: Mon, 12 Nov 2012 09:14:01 -0500 Message-ID: Subject: Re: Optimize facets when actually single valued? To: Erick Erickson Cc: dev@lucene.apache.org Content-Type: multipart/alternative; boundary=047d7b621ee2fdbf1804ce4cea2d X-Virus-Checked: Checked by ClamAV on apache.org --047d7b621ee2fdbf1804ce4cea2d Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi, The version of Solr is 3.6.1, Here's my query, you can find it a bit huge! But i absolutly need all this in my response. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D q=3D*:* fq=3Dlanguage_code:("fr_CA") AND acl_name:(cch_CP_AP_Archives OR cch_archive_content OR cch_browse_official_feed_folder OR cch_folder_acl OR cch_official_feed_content OR cch_official_press_release_acl OR cch_published_story OR cch_pubpage_folder_acl OR cch_raw_content OR cch_restricted_rights_content OR cch_sched_acl OR cch_schedule_acl OR cch_source_acl OR cch_wire_feeds_acl) AND feed_type:("WF" OR "OF" OR "RW")&fq=3D((type:("cch_published_story" OR "cch_story") AND language_code:("fr_CA") AND acl_name:(cch_CP_AP_Archives OR cch_archive_content OR cch_browse_official_feed_folder OR cch_folder_acl OR cch_official_feed_content OR cch_official_press_release_acl OR cch_published_story OR cch_pubpage_folder_acl OR cch_raw_content OR cch_restricted_rights_content OR cch_sched_acl OR cch_schedule_acl OR cch_source_acl OR cch_wire_feeds_acl) AND feed_type:("WF" OR "OF" OR "RW")) OR (type:("cch_photo") AND mfile_url:([* TO *]) AND acl_name:(cch_CP_AP_Archives OR cch_archive_content OR cch_browse_official_feed_folder OR cch_folder_acl OR cch_official_feed_content OR cch_official_press_release_acl OR cch_published_story OR cch_pubpage_folder_acl OR cch_raw_content OR cch_restricted_rights_content OR cch_sched_acl OR cch_schedule_acl OR cch_source_acl OR cch_wire_feeds_acl) AND feed_type:("WF" OR "OF" OR "RW")))&rows=3D0&start=3D0& facet.sort=3Dcount& facet.field=3Dsource_id& *facet.field=3Dfacet_tme_person_name_french&* *facet.field=3Dfacet_tme_geographic_location_french&* *facet.field=3Dfacet_tme_iptc_category&* *facet.field=3Dfacet_tme_organization_name_french&* facet.field=3Dfeed_type& f.source_id.facet.limit=3D-1& f.source_id.facet.mincount=3D1& f.facet_tme_person_name_french.facet.limit=3D25& f.facet_tme_person_name_french.facet.mincount=3D1& f.facet_tme_geographic_location_french.facet.limit=3D25& f.facet_tme_geographic_location_french.facet.mincount=3D1& f.facet_tme_iptc_category.facet.limit=3D25& f.facet_tme_iptc_category.facet.mincount=3D1& f.facet_tme_organization_name_french.facet.limit=3D25& f.facet_tme_organization_name_french.facet.mincount=3D1& f.feed_type.facet.limit=3D25& f.feed_type.facet.mincount=3D1& facet.range=3Dr_creation_date1& facet.range=3Dr_creation_date2& facet.range=3Dr_creation_date3& facet.range=3Dr_creation_date4& f.r_creation_date1.facet.range.start=3DNOW-1HOUR& f.r_creation_date1.facet.range.end=3DNOW& f.r_creation_date1.facet.range.gap=3D+1HOUR& f.r_creation_date2.facet.range.start=3DNOW-24HOUR& f.r_creation_date2.facet.range.end=3DNOW& f.r_creation_date2.facet.range.gap=3D+24HOUR& f.r_creation_date3.facet.range.start=3DNOW-48HOUR& f.r_creation_date3.facet.range.end=3DNOW& f.r_creation_date3.facet.range.gap=3D+48HOUR& f.r_creation_date4.facet.range.start=3DNOW-7DAY& f.r_creation_date4.facet.range.end=3DNOW& f.r_creation_date4.facet.range.gap=3D+7DAY facet=3Dtrue =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D The fields in bold are the fields that i'm having performance issues. I've put the facet.method=3Denum this increase the performance perhaps it i= s still not acceptable for my application. There are the log i've did with the same fq perhaps with each facet field by themselves. Note that only the facet name that starts with "facet" are my multivalued fields. o Date range facet (681,25 ms) o Feed type (586,5 ms) o Categories (898 ms) o facet_tme_geographic_location_french (1249 ms) o facet_tme_person_name_french (1940,75 ms ) o facet_tme_organiztion_name_french (1240,75 ms) All combined give me 6000 ms. For the other questions you've asked me like "How many unique values are there in the field" I don't know how to get this info. *Jimmy M. S=E9lamy* 2012/11/11 Erick Erickson > You have to provide more details. How many unique values are there in the > field in question? What's the query you're using? Are you sure other part= s > of the query aren't the culprit? What Solr version are you using? > > Please review: > http://wiki.apache.org/solr/UsingMailingLists > > Best > Erick > > > On Sat, Nov 10, 2012 at 9:41 PM, Jimmy S=E9lamy wrote: > >> ** >> Im having perfomance issues with facet on multivalued field with an inde= x >> over 20Million documents. >> >> And when doing faceting search on multivalued field the QTIME is >> unacceptable for my application because it can take up to 6000ms. >> >> Ive put the facet.method to enum! Which increased my performance to the >> time i just mentionned! Its still not acceptable. >> >> Is there any suggestions ? >> >> Envoy=E9 avec BlackBerry sur le r=E9seau mobile de Vid=E9otron >> ------------------------------ >> *From: * Robert Muir >> *Date: *Sat, 10 Nov 2012 21:33:47 -0500 >> *To: * >> *ReplyTo: * dev@lucene.apache.org >> *Subject: *Optimize facets when actually single valued? >> >> I am guessing at times people are lazy about schema definition. But, I >> think with lucene 4 stats we can detect if a field is actually single >> valued... Something like terms.size =3D=3D terms.doccount =3D=3D terms.s= umdocfreq. >> I have to think about it a bit, maybe its even simpler than this? Anyway= , >> this couple be used instead of actual schema def to just build a fieldca= che >> instead of uninverted field I think... Should be a simple opto but maybe >> potent... >> > > --047d7b621ee2fdbf1804ce4cea2d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi,

The version of = Solr is 3.6.1,

Here's my query, you can find it a bit huge!= But i absolutly need all this in my response.

= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
=
q=3D*:*

fq=3Dlanguage_code:("fr_CA&q= uot;) AND acl_name:(cch_CP_AP_Archives OR cch_archive_content OR cch_browse= _official_feed_folder OR cch_folder_acl OR cch_official_feed_content OR cch= _official_press_release_acl OR cch_published_story OR cch_pubpage_folder_ac= l OR cch_raw_content OR cch_restricted_rights_content OR cch_sched_acl OR c= ch_schedule_acl OR cch_source_acl OR cch_wire_feeds_acl) AND feed_type:(&qu= ot;WF" OR "OF" OR "RW")&fq=3D((type:("cch= _published_story" OR "cch_story") AND language_code:("f= r_CA") AND acl_name:(cch_CP_AP_Archives OR cch_archive_content OR cch_= browse_official_feed_folder OR cch_folder_acl OR cch_official_feed_content = OR cch_official_press_release_acl OR cch_published_story OR cch_pubpage_fol= der_acl OR cch_raw_content OR cch_restricted_rights_content OR cch_sched_ac= l OR cch_schedule_acl OR cch_source_acl OR cch_wire_feeds_acl) AND feed_typ= e:("WF" OR "OF" OR "RW")) OR (type:("cch= _photo") AND mfile_url:([* TO *]) AND acl_name:(cch_CP_AP_Archives OR = cch_archive_content OR cch_browse_official_feed_folder OR cch_folder_acl OR= cch_official_feed_content OR cch_official_press_release_acl OR cch_publish= ed_story OR cch_pubpage_folder_acl OR cch_raw_content OR cch_restricted_rig= hts_content OR cch_sched_acl OR cch_schedule_acl OR cch_source_acl OR cch_w= ire_feeds_acl) AND feed_type:("WF" OR "OF" OR "RW&= quot;)))&rows=3D0&start=3D0&

= facet.sort=3Dcount&
facet.fiel= d=3Dsource_id&
facet.= field=3Dfacet_tme_person_name_french&
facet.field=3Dfacet_tme_geographic_lo= cation_french&
= facet.field=3Dfacet_tme_iptc_category&
facet.field=3Dfacet_tme_organization_name_= french&
facet.field=3Dfeed_type&
=
f.source_id.facet.limit=3D-1&
f.source_id.facet.mincount=3D1&<= /font>
f.facet_tme_person_name_french.facet.limit=3D2= 5&
f.facet_tme_person_name_fre= nch.facet.mincount=3D1&
f.face= t_tme_geographic_location_french.facet.limit=3D25&
f.facet_tme_geographic_location_french.facet.m= incount=3D1&
f.facet_tme_iptc_= category.facet.limit=3D25&
f.f= acet_tme_iptc_category.facet.mincount=3D1&
f.facet_tme_organization_name_french.facet.lim= it=3D25&
f.facet_tme_organizat= ion_name_french.facet.mincount=3D1&
f.feed_type.facet.limit=3D25&
f.feed_type.facet.mincount=3D1&
facet.range=3Dr_creation_date1&
facet.range=3Dr_creation_date2&=
facet.range=3Dr_creation_date3&
facet.range=3Dr_creation_date4&
<= div>f.r_creation_date1.facet.range.start=3DNOW-1HOU= R&
f.r_creation_date1.facet.range.end=3DNOW&<= /font>
f.r_creation_date1.facet.range.gap= =3D+1HOUR&
f.r_creation_date2.= facet.range.start=3DNOW-24HOUR&
f.r_creation_date2.facet.range.end=3DNOW&<= /font>
f.r_creation_date2.facet.range.gap= =3D+24HOUR&
f.r_creation_date3= .facet.range.start=3DNOW-48HOUR&
f.r_creation_date3.facet.range.end=3DNOW&<= /font>
f.r_creation_date3.facet.range.gap= =3D+48HOUR&
f.r_creation_date4= .facet.range.start=3DNOW-7DAY&
f.r_creation_date4.facet.range.end=3DNOW&<= /font>
f.r_creation_date4.facet.range.gap= =3D+7DAY

facet=3Dtrue

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D

The fields in bold are the fields tha= t i'm having performance issues.

I've put the facet.method=3Denum this increase the performance perhap= s it is still not acceptable for my application. There are the log i've= did with the same fq perhaps with each facet field by themselves. Note tha= t only the facet name that starts with "facet" are my multivalued= fields.


o Date range facet (681,25 m= s)

o Feed type (586,5 ms)

o Catego= ries (898 ms)

o facet_= tme_geographic_location_french (1249 ms)

o=A0= facet_tme_person_name_french=A0(1940,75 ms )

o=A0= facet_tme_organiztion_name_french=A0(1240,75 ms)

All combined give me 6000 ms.

For the other questions you'v= e asked me like "How many unique values are there in the field" I= don't know how to get this info.

Jim= my M. S=E9lamy


2012/11/11 Erick Erickson <ericke= rickson@gmail.com>
You have to provide more details. How many unique values are there in the f= ield in question? What's the query you're using? Are you sure other= parts of the query aren't the culprit? What Solr version are you using= ?

Please review:

Best
Erick


On Sat, Nov 10, 2012 at 9:41 PM, Jimmy S= =E9lamy <jymysy@gmail.com> wrote:
Im having perfomance issues with facet on multivalued field wit= h an index over 20Million documents.

And when doing faceting search = on multivalued field the QTIME is unacceptable for my application because i= t can take up to 6000ms.

Ive put the facet.method to enum! Which increased my performance to the= time i just mentionned! Its still not acceptable.

Is there any sugg= estions ?

Envoy=E9 avec BlackBerry sur le r=E9seau mobile de V= id=E9otron

From: Robert Muir <rcmuir@gmail.com>
Date: Sat, 10 Nov 2012 21:33:47 -0500
Subject: Optimize facets when actually single valued?

I am guessing at times people are lazy about schem= a definition. But, I think with lucene 4 stats we can detect if a field is = actually single valued... Something like terms.size =3D=3D terms.doccount = =3D=3D terms.sumdocfreq. I have to think about it a bit, maybe its even sim= pler than this? Anyway, this couple be used instead of actual schema def to= just build a fieldcache instead of uninverted field I think... Should be a= simple opto but maybe potent...



--047d7b621ee2fdbf1804ce4cea2d--