lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4769) Add a CountingFacetsAggregator which reads ordinals from a cache
Date Tue, 12 Feb 2013 04:37:12 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576367#comment-13576367
] 

Shai Erera commented on LUCENE-4769:
------------------------------------

I didn't propose that we add a DV format, I was saying that if there was one, then a DirectFacets
format would make sense, b/c the app wouldn't need to write special code to work with it ...
it would just return the ints more efficiently.

And we're abusing DV now, just like we abused payloads before, so nothing has changed :).

I did propose on another issue (forgot where, maybe the migration layer issue?) to develop
a FacetsCodec, but you were against it. Perhaps after you worked on DV 2.0 you now think it
makes more sense? It will solve a slew of problems I think.

This FacetsCodec today is mimicked by CategoryListIterator which exposes that getInts API.
But Mike and I saw that the DV abstraction (getBytes) + CLI (getInts) hurts performance, therefore
the \*fast\* aggregators / collectors sidestep the CLI abstrtaction and uses only DV. On LUCENE-4764,
mike sidesteps the DV abstraction too, which results in more duplicated code. I'm all for
those specializations, but it becomes harder to maintain. I just think of all the places we'd
need to change if someone will find a better encoding than gap+vint :). 

Plus, the specialization doesn't serve the different facet features. I.e. if I'm interested
in fast sum-score, I need to write a specialized one. If I'm interested in fast sum-association,
I need to write one. Just to be clear, I'm not complaining and I think it makes sense for
expert apps to write some specialized code. What I am saying is that if we could make the
abstractions FAST, then we'd lower the bar of when apps would need to do that ...

So far, our latest optimizations only pertain to the counting case. It is the common case
and I think it's important that we did that. Perhaps the rest of the API changes also improved
the other cases too, but it's clear that if we want to really speed them up, we should specialize
them.

Maybe if we had a FacetsCodec, with CategoryListFormat (an extension to Codec, private to
Facets), then LUCENE-4764 and this issue would benefit out-of-the-box all facet features.
Because that format will expose what facets need - a getInts API. And if we make this one
a Codec and FastDV a Codec, then we anyway force the app to declare a special facets Codec,
so at least from that aspect, we won't require more ...

And if we do a FacetsCodec w/ CategoryListFormat, then at first it can continue to abuse DV,
but then in the future we can explore a different format to manage the per-document categories
(and support category associations). Maybe even a way to manage the taxonomy in the main index,
in its own data structure ...

Perhaps these two issues show the usefulness of having such Codec?
                
> Add a CountingFacetsAggregator which reads ordinals from a cache
> ----------------------------------------------------------------
>
>                 Key: LUCENE-4769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4769
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/facet
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>         Attachments: LUCENE-4769.patch
>
>
> Mike wrote a prototype of a FacetsCollector which reads ordinals from a CachedInts structure
on LUCENE-4609. I ported it to the new facets API, as a FacetsAggregator. I think we should
offer users the means to use such a cache, even if it consumes more RAM. Mike tests show that
this cache consumed x2 more RAM than if the DocValues were loaded into memory in their raw
form. Also, a PackedInts version of such cache took almost the same amount of RAM as straight
int[], but the gains were minor.
> I will post the patch shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message