lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gonzalez (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-7867) implicit sharded, facet grouping problem with multivalued string field starting with digits
Date Thu, 06 Aug 2015 00:03:04 GMT

    [ https://issues.apache.org/jira/browse/SOLR-7867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659224#comment-14659224
] 

Jonathan Gonzalez edited comment on SOLR-7867 at 8/6/15 12:02 AM:
------------------------------------------------------------------

The problem rely on the docValues attribute, for some reason the dvd file becomes corrupted
after several incremental feedings (at least in my case),  I'm able to reproduce this problem
either on SolrCloud and Standalone instance, the query has to have &group.facet=true and
the facet field definition docValues=true.

A short-term fix: disable the docValues attribute (docValues=false).

Fields definition:
{code}
<field name="fieldForGrouping" type="int" indexed="true" stored="false" multiValued="false"
omitNorms="true" termVectors="false" termPositions="false" docValues="false"/>
<field name="fieldForFacet" type="string" indexed="true" stored="true" multiValued="true"
omitNorms="true" termVectors="false" termPositions="false" docValues="true"/>
{code}

Query:
The query is using &group.field=<fieldForGrouping>&group.facet=true and a simple
facet like:
{code}
&facet.field={!key=FacetKey_12345678%20facet.prefix=12345678}fieldForFacet
{code}

The following image, shows Solr reading the index file of type dvd (Per-Document Values .dvd,
.dvm - Encodes additional scoring factors or other per-document information. https://lucene.apache.org/core/5_2_0/core/org/apache/lucene/codecs/lucene50/Lucene50DocValuesFormat.html),
enabled by the docValues=true. (https://cwiki.apache.org/confluence/display/solr/DocValues)
!ErrorReadingDocValues.PNG!

Then trying to read the facet.prefix value from this dvd file, there is an attempt to read
more than the current buffer size causing this issue:
!DocValuesException.PNG!

I hope it helps!



was (Author: jonathan gv):
The problem rely on docValues attribute, for some reason the dvd file becomes corrupted after
several incremental feeding,  I'm able to reproduce this problem and fix it by disabling the
docValues attribute docValues=false.

Fields definition:
{code}
<field name="fieldForGrouping" type="int" indexed="true" stored="false" multiValued="false"
omitNorms="true" termVectors="false" termPositions="false" docValues="false"/>
<field name="fieldForFacet" type="string" indexed="true" stored="true" multiValued="true"
omitNorms="true" termVectors="false" termPositions="false" docValues="true"/>
{code}

Query:
The query is using &group.field=<fieldForGrouping>&group.facet=true and a simple
facet like:
{code}
&facet.field={!key=FacetKey_12345678%20facet.prefix=12345678}fieldForFacet
{code}

The following image, shows Solr reading the index file of type dvd (Per-Document Values .dvd,
.dvm - Encodes additional scoring factors or other per-document information. https://lucene.apache.org/core/5_2_0/core/org/apache/lucene/codecs/lucene50/Lucene50DocValuesFormat.html),
enabled by the docValues=true. (https://cwiki.apache.org/confluence/display/solr/DocValues)
!ErrorReadingDocValues.PNG!

Then trying to read the facet.prefix value from this dvd file, there is an attempt to read
more than the current buffer size causing this issue:
!DocValuesException.PNG!

I hope it helps!


> implicit sharded, facet grouping problem with multivalued string field starting with
digits
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7867
>                 URL: https://issues.apache.org/jira/browse/SOLR-7867
>             Project: Solr
>          Issue Type: Bug
>          Components: faceting, SolrCloud
>    Affects Versions: 5.2
>         Environment: 3.13.0-48-generic #80-Ubuntu SMP x86_64 GNU/Linux
> java version "1.7.0_80"
> Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
>            Reporter: Umut Erogul
>              Labels: docValues, facet, group, sharding
>         Attachments: DocValuesException.PNG, ErrorReadingDocValues.PNG
>
>
> related parts @ schema.xml:
> {code}<field name="keyword_ss" type="string" indexed="true" stored="true" docValues="true"
multiValued="true"/>
> <field name="author_s" type="string" indexed="true" stored="true" docValues="true"/>{code}
> every document has valid author_s and keyword_ss fields;
> we can make successful facet group queries on single node, single collection, solr-4.9.0
server
> {code}
> q: *:* fq: keyword_ss:3m
> facet=true&facet.field=keyword_ss&group=true&group.field=author_s&group.facet=true
> {code}
> when querying on solr-5.2.0 server with implicit sharded environment with:
> {code}<!-- router.field -->
> <field name="shard_name" type="string" indexed="true" stored="true" required="true"/>{code}
> with example shard names; affinity1 affinity2 affinity3 affinity4
> the same query with same documents gets:
> {code}
> ERROR - 2015-08-04 08:15:15.222; [document affinity3 core_node32 document_affinity3_replica2]
org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Exception during
facet.field: keyword_ss
>         at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:632)
>         at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:617)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at org.apache.solr.request.SimpleFacets$2.execute(SimpleFacets.java:571)
>         at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:642)
> ...
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>         at org.apache.lucene.codecs.lucene50.Lucene50DocValuesProducer$CompressedBinaryDocValues$CompressedBinaryTermsEnum.readTerm(Lucene50DocValuesProducer.java:1008)
>         at org.apache.lucene.codecs.lucene50.Lucene50DocValuesProducer$CompressedBinaryDocValues$CompressedBinaryTermsEnum.next(Lucene50DocValuesProducer.java:1026)
>         at org.apache.lucene.search.grouping.term.TermGroupFacetCollector$MV$SegmentResult.nextTerm(TermGroupFacetCollector.java:373)
>         at org.apache.lucene.search.grouping.AbstractGroupFacetCollector.mergeSegmentResults(AbstractGroupFacetCollector.java:91)
>         at org.apache.solr.request.SimpleFacets.getGroupedCounts(SimpleFacets.java:541)
>         at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:463)
>         at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:386)
>         at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:626)
>         ... 33 more
> {code}
> all the problematic queries are caused by strings starting with digits; ("3m", "8 saniye",
"2 broke girls", "1v1y")
> there are some strings that the query works like ("24", "90+", "45 dakika")
> we do not observe the problem when querying with 
> -keyword_ss:(0-9)*
> updating the problematic documents (a small subset of keyword_ss:(0-9)*), fixes the query,

> but we cannot find an easy solution to find the problematic documents
> there is around 400m docs; seperated at 28 shards; 
> -keyword_ss:(0-9)* matches %97 of documents



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message