incubator-accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <jason.rutherg...@gmail.com>
Subject Re: Scan for keyword
Date Wed, 23 Nov 2011 20:33:15 GMT
Aaron and Joe,

Another approach would be to use Lucene as a 'side' inverted index.
The postings format is highly compressed with the end result being
that lookups will be fast (and smaller), because they are [likely]
resident in the system IO cache.

I had a start at implementing this in HBase, however if I did it over
again, I would avoid storing the index in HDFS because it was too hard
to get random access IO to the underlying file, without a significant
branch to HDFS.  Similar to how Accumulo stores it's transaction log
on the local filesystem (avoiding HDFS, in contrast with HBase which
uses HDFS sync).

Food for thought.

Jason

On Wed, Nov 23, 2011 at 3:08 PM, Aaron Cordova <aaron@cordovas.org> wrote:
> Joe,
> What you're talking about is pretty common. In fact, it's so common there
> should probably be an example included in the Acccumulo-examples project for
> it. To do it requires building another table as a secondary index, as Jason
> mentioned.  Accumulo doesn't have any special structures just for indexes,
> it's just another table. Here's how you might go about it:
> Assuming using some unique identifier for you row IDs, your table might look
> something like this:
> rowID col fam col qual value
> 000 displayname joey
> 000 login jd
> 000 name joe
> 001 displayname jd
> 001 login joe
> 001 name joey
>
> I would just leave the col qual blank. Then you could build a second table
> as an index that looks like this:
> rowID col fam col qual value
> jd displayname 001
> jd login 000
> joe login 001
> joe name 000
> joey displayname 000
> joey name 001
>
> To build this table, you can simply insert the inverted Mutations into the
> index table at the same time you're inserting records into your first table.
> To query for records in which "joe" appears in any field, you simply scan
> the entire row identified by "joe" in the index and get all the fields in
> all records where "joe" appears, thus:
> scanner.setRange(new Range("joe"));
> To get records where "joe" appears in a specific field, say the name field,
> alter your scan to include a more specific range:
> s.setRange(new Range(new Key(new Text("joe"), new Text("name"), new Text("")), new Key(new Text("joe"), new Text("name\0"), new Text(""))));
>
> That range spans joe name to joe name\0, which includes all column
> qualifiers up to the next column family.
> You can then pull out the column qualifiers from the index to get the
> rowIDs.
> If you want to lookup values from each of those rows, you could then put
> them in a List and pass them to a BatchScanner. There is code for this in
> the Indexing subsection of the Table Design section of the manual:
> Text term = new Text("mySearchTerm");
>
> HashSet<Text> matchingRows = new HashSet<Text>();
>
> Scanner indexScanner = createScanner("index", auths);
> indexScanner.setRange(new Range(term, term));
>
> // we retrieve the matching rowIDs and create a set of ranges
> for(Entry<Key,Value> entry : indexScanner)
> matchingRows.add(new Text(entry.getValue()));
>
> // now we pass the set of rowIDs to the batch scanner to retrieve them
> BatchScanner bscan = conn.createBatchScanner("table", auths, 10);
>
> bscan.setRanges(matchingRows);
> bscan.fetchFamily("attributes");
>
> for(Entry<Key,Value> entry : scan)
>
> System.out.println(e.getValue());
>
>
> This whole process is more complicated than I'd like it to be, but it works
> pretty well and people have built huge tables and indexes this way. You can
> get very fancy with what and how you choose to index.
> Let us know how this goes for you.
> Aaron
>
> On Nov 23, 2011, at 2:35 PM, Joey Daughtery wrote:
>
> Aaron
> Thanks for the reply.  I was only able to get data into Accumulo after
> reviewing the page you provided.
>
> Lets say for example that I am storing a Name, login, displayName columns as
> the column family.  And I have inserted Joe, jd, joey as one record and
> joey, joe, jd for the second record.
>
> mut.put(new Text("Name"), new Text("joe"), cv, new Value("joe");
> mut.put(new Text("login"), new Text("jd"), cv, new Value("jd");
> mut.put(new Text("DisplayName"), new Text("joey"), cv, new Value("joey");
> write(...)
>
> mut.put(new Text("Name"), new Text("joey"), cv, new Value("joey");
> mut.put(new Text("login"), new Text("joe"), cv, new Value("joe");
> mut.put(new Text("DisplayName"), new Text("jd"), cv, new Value("jd");
> write(...)
>
> How would I execute a keyword search for "joe" in an attempt to pull back
> both records where Joe is the value for Login for one record while "joe" is
> a value for Name in another?
>
> The example in the Table Design page shows the search based on the row id.
> From my understanding if I provide the rowId, it will limit the search to
> that row.  But the example on that page is essentially just loading a
> specific row based on a rowid, not a keyword search.
>
> Thanks for the reply.  I hope my explanation of what I am attempting to do
> is making sense.
>
> Joe
>
> On Wed, Nov 23, 2011 at 1:55 PM, Aaron Cordova <aaron@cordovas.org> wrote:
>>
>> Joe,
>>
>> If you haven't already, check out the Table Design section of the Manual
>>
>>
>>  http://incubator.apache.org/accumulo/user_manual_1.3-incubating/Table_Design.html
>>
>> specifically, the subsection titled 'Indexing'. If you have read this, let
>> us know and we can clarify.
>>
>> Aaron
>>
>>
>> On Nov 23, 2011, at 1:46 PM, Jason Rutherglen wrote:
>>
>> > The most efficient system would be to implement a secondary [inverted]
>> > index on the Accumulo data.
>> >
>> > May there is a Coprocessor like API that would allow this type of
>> > functionality to be implemented?
>> >
>> > On Wed, Nov 23, 2011 at 1:12 PM, Joey Daughtery
>> > <jdaughtery@t-sciences.com> wrote:
>> >> All
>> >> I am new to Accumulo.  I have figured out how to store the data, load
>> >> all
>> >> based on scanning with new Range(), and loading a specific row based on
>> >> new
>> >> Range(id).  However, if I want to locate a row that has a specific
>> >> value, I
>> >> am not sure how to approach this programmatically.  Can someone give me
>> >> some
>> >> insight on how to do such a scan?
>> >>
>> >> Also, I have seen several examples of how to populate the Mutation
>> >> object.
>> >> Specifically, I see:
>> >> mut.put(new Text("column"), new Text("NAME"), timestamp, new
>> >> Value("John");
>> >>
>> >> OR
>> >> mut.put(new Text("NAME"), new Text("John"), timestamp, new
>> >> Value("John);
>> >>
>> >> Could someone indicate which is the correct way to store the data or
>> >> indicate why one would use one approach over the other?
>> >>
>> >> Thanks
>> >>
>> >> Joe
>>
>
>
>

Mime
View raw message