accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "mohit.kaushik" <>
Subject Re: Document Partitioned Indexing
Date Thu, 01 Oct 2015 09:54:29 GMT

I am trying to find best way to Index documents data on the bases of 
time in Accumulo. The core objective is to make the time based queries 
fast/efficient. I have two types of date (may have be more types) and I 
want to query & index data on both.

As I a have to two timestamps(date), For the first Index if I create an 
index and store time in the Rowid for one timestamp. This way I can 
create partial start and end id and can pass it as range in scanner. And 
for the other, say I group the documents Index on the bases of time say 
per hour or per minute (one minute data goes to single row around 2500 
docs). Therefore, the Rowid contains the "time" the CF contains the 
"Field/value" and the CQ contains the "DocId".

(1) If I fetch a "field/value" as CF for a same time range from both 
indexes. Which one would be faster.
(2) If I create locality groups dynamically for every value in 
CF(field/value) and there are in total around 10000 distinct 
field/values (say an index over location/city and per city there are 
100000 or more documents indexed on an avg). Means 10000 locality 
groups, how will it affect the query performance ???

Mohit Kaushik

On 09/30/2015 08:57 PM, Adam Fuchs wrote:
> Hi Tom,
> Sqrrl uses a document-distributed indexing strategy extensively. On 
> top of the reasons you mentioned, we also like the ability to 
> explicitly structure our index entries in both information content and 
> sort order. This gives us the ability to do interesting things like 
> build custom indexes and do joins between graph indexes and term indexes.
> Eventually, I'd like to see Accumulo build out explicit support for 
> this type of indexing in the core as an embedded secondary indexing 
> capability. That would solve several of the challenges around 
> compatibility with other Accumulo features and usage patterns.
> Cheers,
> Adam
> On Wed, Sep 30, 2015 at 3:48 AM, Tom D < 
> <>> wrote:
>     Hi,
>     Have been doing a little reading about different distributed
>     (text) indexing techniques and picked up on the Document
>     Partitioned Index approach on Accumulo.
>     I am interested in the use-cases people would have for indexing
>     data in this way over using a distributed search service (Elastic
>     or SolrCloud).
>     I can think of a few reasons, but wondered if there's something
>     more obvious that I'm missing?
>     - cell (field level) access controls
>     - scale - I understand Accumulo will scale to thousands of nodes.
>     I believe there are some limitations in Elastic / Solr at about
>     100 nodes.
>     - integration with an existing schema or index in Accumulo (not
>     sure about this one and what benefits it would have over calling
>     out to a search service)
>     - you want to take advantage of other features in Accumulo, e.g.
>     Combining iterators to perform some aggregation alongside your
>     document partitioned index (again, can't imagine use cases here,
>     but maybe there are some)
>     - more control over 'messy data', e.g partial duplicates that need
>     merging at ingest
>     Are there others? Be interesting to hear if people use this
>     indexing strategy.
>     Many thanks.


*Mohit Kaushik*
Software Engineer
A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
*Tel:*+91 (124) 4969352 | *Fax:*+91 (124) 4033553

<>interactive social intelligence at work...

<> <> 
<> ... ensuring Assurance in complexity and uncertainty

/This message including the attachments, if any, is a confidential 
business communication. If you are not the intended recipient it may be 
unlawful for you to read, copy, distribute, disclose or otherwise use 
the information in this e-mail. If you have received it in error or are 
not the intended recipient, please destroy it and notify the sender 
immediately. Thank you /

View raw message