hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tarandeep Singh <tarand...@gmail.com>
Subject Re: Indexing on top of Hadoop
Date Wed, 10 Jun 2009 17:02:12 GMT
We have built basic index support in CloudBase (a data warehouse on top of
Hadoop- http://cloudbase.sourceforge.net/) and can share our experience

The index we built is like a Hash Index- for a given column/field value, it
tries to process only those data blocks which contain that value while
ignoring rest of the blocks. As you would know, Hadoop stores data in the
form of data blocks (64MB default size). During index creation stage ( a Map
reduce job), we store the distinct column/field values along with their
block information (node, filename, offset etc) in Hadoop's MapFile. A Map
file is a SequenceFile (stores key,value pairs) which is sorted on keys. So
you can see, it is like building an inverted index, where keys are the
column/field values and posting lists are the lists containing blocks
information. A Map file is not a very efficient persistent map, so you can
store this inverted index in local file system using something like Lucene
Index, but as your index grows you are consuming local space of Master node.
Hence we decided to store it in HDFS using Map file.

When you query using the column/field that was indexed, first this inverted
index (MapFile) is consulted to find all the blocks that contain the desired
value and InputSplits are created. As you would know a Mapper works on Input
Split and in normal cases, Hadoop will schedule jobs that will work on all
input splits but in this case, we have written our own InputFormat that will
schedule jobs only on required input splits.

We have measured performance of this approach in some cases we have got 97%
improvement (we indexed on date field and ran queries on year's log files
but fetching only one or few particular day(s) of data).

Now some caveats-

1) If the column, field that you are indexing contain a large number of
distinct values, then your inverted index (the MapFile) is gonna bloat up
and this file is looked up before any Map Reduce job is started so this
means this code runs on master node. This can become very slow. For example,
if you have query logs and you index query column/field, then the size of
Map file can become really huge. Using Lucene Index on local file sytem can
speed up the lookup.

2) It does not work on range queries like column < 10 etc. To solve this, we
store the min and max value of the column/field found in a block also. That
is for each block, min and max value of the column/field is stored and when
we encounter range query, this list is scanned to eliminate some blocks.

Another indexing approaches we have thougth of-

To solve problem 1) where the size of inverted index (MapFile) becomes huge,
we can use an approach called BloomIndexing. This approach makes use of
BloomFilters (space-efficient probabilistic data structures that is used to
test whether an element is a member of a set or not). For each data block, a
bloom filter is constructed. For 64MB, 128MB, 256MB data block sizes, the
bloom filter will be extremely small. During index creation stage (Map job,
no reducer), a mapper reads the blok/input split and creates a bloom fitler
for the column/filed on which you want to create your index and stores it in

During query phase, before processing of the input split/data block first
the corresponding bloom filter is consulted to see if the data block
contains the desired value or not. As you would know a bloom filter will
never give false negative, so if bloom test fails, you can safely ignore
processing of that block. This will save you time that you would have spent
processing each row/line in the data block.

Advantages of this approach- As compared to HashIndexing, this apporach
scatters the block selection logic in Map Reduce job so master node is not
overloaded to scan huge inverted index.

Disadvantages of this approach- You still have to schedule as many mappers
as the number of input splits/data blocks and starting JVMs incur overheads,
however since hadoop-0.19 you can use "reuse jvm flag" to avoid some
overheads. Further you can increase your block size to 128 or 256MB that
will give you considerable performance improvent.

Hope this helps,

On Wed, Jun 10, 2009 at 5:49 AM, kartik saxena <kartik.sxn@gmail.com> wrote:

> Hi,
> I have a huge  LDIF file in order of GBs spanning some million user
> records.
> I am running the example "Grep" job on that file. The search results have
> not really been
> upto expectations because of it being a basic per line , brute force.
> I was thinking of building some indexes inside HDFS for that file , so that
> the search results could improve. What could I possibly try to achieve
> this?
> Secura

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message