lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Li (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system
Date Wed, 03 Sep 2008 15:29:44 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628025#action_12628025
] 

Ning Li commented on LUCENE-532:
--------------------------------

Is the use of seek and write in ChecksumIndexOutput making Lucene less likely to support all
sequential write (i.e. no seek write)? ChecksumIndexOutput is currently used by SegmentInfos.

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: cfs-patch.txt, indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed
file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately
found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException.
The reason for this behavior is clear - DFS does not support random updates and so seek()
method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search
in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter
and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter -
the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of
terms in the file back into the beginning of the file. It was very simple to change file format
a little bit and write number of terms into last 8 bytes of the file instead of writing them
into beginning of file. The only other place that should be fixed in order for this to work
is in SegmentTermEnum constructor - to read this piece of information at position = file length
- 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS
without any problems. Well - we still don't index directly to DFS for performance reasons,
but at least we can build small local indexes and merge them into the main index on DFS without
copying big main index back and forth. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message