lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
Date Tue, 16 Mar 2010 04:33:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845703#action_12845703
] 

Michael Busch commented on LUCENE-2312:
---------------------------------------

{quote}
Sounds like awesome progress!! Want some details over here :)
{quote}

Sorry for not being very specific.  The prototype I'm experimenting with has a fixed length
postings format for the in-memory representation (in TermsHash).  Basically every posting
has 4 bytes, so I can use int[] arrays (instead of the byte[] pools).  The first 3 bytes are
used for an absolute docID (not delta-encoded). This limits the max in-memory segment size
to 2^24 docs.  The 1 remaining byte is used for the position.  With a max doc length of 140
characters you can fit every possible position in a byte - what a luxury! :)  If a term occurs
multiple times in the same doc, then the TermDocs just skips multiple occurrences with the
same docID and increments the freq.  Again, the same term doesn't occur often in super short
docs.

The int[] slices also don't have forward pointers, like in Lucene's TermsHash, but backwards
pointers.  In real-time search you often want a strongly time-biased ranking.  A PostingList
object has a pointer that points to the last posting (this statement is not 100% correct for
visibility reasons across threads, but we can imagine it this way for now).  A TermDocs can
now traverse the postinglists in opposite order.  Skipping can be done by following pointers
to previous slices directly, or by binary search within a slice.

> Search on IndexWriter's RAM Buffer
> ----------------------------------
>
>                 Key: LUCENE-2312
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2312
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Jason Rutherglen
>            Assignee: Michael Busch
>             Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max doc ids.
 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message