lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system
Date Mon, 13 Nov 2006 19:00:40 GMT
    [ ] 
Michael McCandless commented on LUCENE-532:

Thank you for the patch & unit test!

This is actually the same approach that I started with.  But I ruled
it out because I don't think it's safe to do arithmetic (ie, adding
lengths to compute positions) on file positions.

Meaning, one can imagine a Directory implementation that's doing some
kind of compression where on writing N bytes the file position does
not in fact advance by N bytes.  Or maybe an implementation that must
escape certain bytes, or it's writing to XML or using some kind of
alternate coding system, or something along these lines.  I don't know
if such Directory implementations exist today, but, I don't want to
break them if they do nor preclude them in the future.

And so the only value you should ever pass to "seek()" is a value you
previously obtained by calling "getFilePosition()".  The current
javadocs for these methods seem to imply this.

However, on looking into this question further ... I do see that there
are places now where Lucene already does arithmetic on file positions.
For example in accessing a *.fdx file or *.tdx file we assume we can
find a given entry at FORMAT_SIZE + 8 * index file position.

Maybe it is OK to make the definition of stricter, by
requiring that in fact the position we pass to seek is always the same
as "the number of bytes written", thereby allowing us to do arithmetic
based on bytes/length and call seek with such values?  I'm nervous
about making this API change.

I think this is the open question.  Does anyone have any input to help
answer this question?

Lucene currently makes this assumption, albeit in a fairly contained
way I think (most other calls to seek seem to be values previously
obtained by getFilePosition()).

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>                 Key: LUCENE-532
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: cfs-patch.txt, indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed
file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately
found that indexing fails because method throws UnsupportedOperationException.
The reason for this behavior is clear - DFS does not support random updates and so seek()
method can't be supported (at least not easily).
> Well, if we can't support random updates - the question is: do we really need them? Search
in the Lucene code revealed 2 places which call method: one is in TermInfosWriter
and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter -
the only place that concerned us was in TermInfosWriter.
> TermInfosWriter uses in its close() method to write total number of
terms in the file back into the beginning of the file. It was very simple to change file format
a little bit and write number of terms into last 8 bytes of the file instead of writing them
into beginning of file. The only other place that should be fixed in order for this to work
is in SegmentTermEnum constructor - to read this piece of information at position = file length
- 8.
> With this format hack - we were able to use FsDirectory to write index directly to DFS
without any problems. Well - we still don't index directly to DFS for performance reasons,
but at least we can build small local indexes and merge them into the main index on DFS without
copying big main index back and forth. 

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message