lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
Date Fri, 17 Apr 2009 13:44:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700177#action_12700177
] 

Michael McCandless commented on LUCENE-831:
-------------------------------------------

I've been struggling with the "right" way forward here... despite
following all comments and aggressive ongoing mulling, I still don't
have much clarity.

It feels like one of those features that just hasn't quite "clicked"
yet (to me at least).  In fact, the more I try to think about it, the
less clarity I get!

I think there're some cncrete reasons to create a new API (some
overlap w/ Mark's list above):

  * Make caching "external"/public so you can control when things are
    evicted

  * Cleaner API -- it's just awkward that you now must call a separate
    place (ExtendedFieldCache.EXT_DEFAULT) to getInts.  FieldCache &
    ExtendedFieldCache are awkward, and they are interfaces.  It makes
    more sense to ask the reader directly for ints (or a future
    component of the reader).

  * Better extensibility on uninversion (either via "you make your own
    ValueSource entirely", or "you can subclass Uninverted and tweak
    it").  Trie needs this (though, we have a viable approach in field
    cache).  Fields with more than one value want custom control to
    pick one.

  * Making it not-so-easy to get all field values at the reader level
    (don't set dangerous API traps)

Honestly these reasons are not net/net compelling enough to warrant a
whole new API?  They are fairly minor.  And I agree: LUCENE-1483 has
already achieved the biggest step forward here.

Furthermore, there are other innovations happening that may affect how
we do this. EG LUCENE-1597 introduces type information for fields (at
least at indexing time), and Earwin is working on "componentizing"
SegmentReader.  Normally I don't like letting "big distant future
feature X" prevent progess on "today's feature Y", but since we lack
clarity on Y...

I can imagine a future when the FieldType would be the central place
that records all details for a field:

  * The analyzer to use (so we don't need PerFieldAnalyzerWrapper)

  * The ValueSource

  * It's "native" type (now "switched" in many places, like
    FieldComparator, SortField, FieldCache, etc.)

  * All the index-time configuration

And then instead of having ValueSource dispatch per field, we'd simply
ask the FieldType what it's source is.

Finally, there are a number of future improvements we should take into
account.  We wouldn't try to accomplish these right now, but we ought
to think about them (eg, not preclude them) in whatever approach we
settle on:

  * We need source pluggability for when CSF arrives (but, admittedly,
    we could wait until CSF actually does arrive)

  * Allowing values to change, just like we can call
    IndexReader.setNorm/deleteDoc to change norms/deletes. We'd need a
    copy-on-write approach, like norms & deleted docs.

  * How would norms be folded into this?  Ideally, each field could
    choose to pull its norms from any source.  Document level norms
    was discussed somewhere, and should easily "fit" as another norms
    source.  We'd need to relax how per-doc-field boosting is computed
    at runtime to pull from such "arbitrary" sources.

  * Deleted docs could also be represented as a ValueSource?  Just one
    bit per doc.  This way one could swap in whatever source for
    "deleted docs" one wanted.

  * Allowing for docs that have more than one value.  (We'd also need
    to extend sorting to be able to compare multiple vlaues).

  * An mmap implementation (like Lucy/KS) -- should feel just like CSF
    or uninversion (ie, "just another impl").

  * Impls of getStrings and getStringIndex that are based on offsets
    into char[] (not actual individual String object).

  * Good impls for the enum case (all strings could be considered
    enums), eg if there are only 100 unique strings in that field, you
    only need 7 bits per ord derefing into the char[] values.

  * Possible future when Lucene computes sort cache (for text fields)
    and stores in the index

  * Allowing field sort to use an entirely external source of values

There's alot to think about :)


> Complete overhaul of FieldCache API/Implementation
> --------------------------------------------------
>
>                 Key: LUCENE-831
>                 URL: https://issues.apache.org/jira/browse/LUCENE-831
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Hoss Man
>            Assignee: Mark Miller
>             Fix For: 3.0
>
>         Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff,
fieldcache-overhaul.diff, LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff,
LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch,
LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch,
LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
>     a) eliminate global static map keyed on IndexReader (thus
>         eliminating synch block between completley independent IndexReaders)
>     b) allow more customization of cache management (ie: use 
>         expiration/replacement strategies, disk backed caches, etc)
>     c) allow people to define custom cache data logic (ie: custom
>         parsers, complex datatypes, etc... anything tied to a reader)
>     d) allow people to inspect what's in a cache (list of CacheKeys) for
>         an IndexReader so a new IndexReader can be likewise warmed. 
>     e) Lend support for smarter cache management if/when
>         IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
>     the new implementation, so there is no redundent caching as client code
>     migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message