incubator-accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Cordova <aa...@cordovas.org>
Subject Re: Scan for keyword
Date Wed, 23 Nov 2011 20:08:33 GMT
Joe,

	What you're talking about is pretty common. In fact, it's so common there should probably
be an example included in the Acccumulo-examples project for it. To do it requires building
another table as a secondary index, as Jason mentioned.  Accumulo doesn't have any special
structures just for indexes, it's just another table. Here's how you might go about it:

	Assuming using some unique identifier for you row IDs, your table might look something like
this:

rowID	col fam		col qual	value
000	displayname			joey
000	login				jd
000	name				joe
001	displayname			jd
001	login				joe
001	name				joey


	I would just leave the col qual blank. Then you could build a second table as an index that
looks like this:

rowID	col fam		col qual	value
jd	displayname	001
jd	login		000
joe	login		001
joe	name		000
joey	displayname	000
joey	name		001


	To build this table, you can simply insert the inverted Mutations into the index table at
the same time you're inserting records into your first table.

	To query for records in which "joe" appears in any field, you simply scan the entire row
identified by "joe" in the index and get all the fields in all records where "joe" appears,
thus:

scanner.setRange(new Range("joe"));

	To get records where "joe" appears in a specific field, say the name field, alter your scan
to include a more specific range:

s.setRange(new Range(new Key(new Text("joe"), new Text("name"), new Text("")), new Key(new
Text("joe"), new Text("name\0"), new Text(""))));


	That range spans joe name to joe name\0, which includes all column qualifiers up to the next
column family.

	You can then pull out the column qualifiers from the index to get the rowIDs. 

	If you want to lookup values from each of those rows, you could then put them in a List and
pass them to a BatchScanner. There is code for this in the Indexing subsection of the Table
Design section of the manual:

Text term = new Text("mySearchTerm");

HashSet<Text> matchingRows = new HashSet<Text>();

Scanner indexScanner = createScanner("index", auths);
indexScanner.setRange(new Range(term, term));

// we retrieve the matching rowIDs and create a set of ranges
for(Entry<Key,Value> entry : indexScanner)
	matchingRows.add(new Text(entry.getValue()));

// now we pass the set of rowIDs to the batch scanner to retrieve them
BatchScanner bscan = conn.createBatchScanner("table", auths, 10);

bscan.setRanges(matchingRows);
bscan.fetchFamily("attributes");

for(Entry<Key,Value> entry : scan)

	System.out.println(e.getValue());


	This whole process is more complicated than I'd like it to be, but it works pretty well and
people have built huge tables and indexes this way. You can get very fancy with what and how
you choose to index.

	Let us know how this goes for you.

Aaron


On Nov 23, 2011, at 2:35 PM, Joey Daughtery wrote:

> Aaron
> Thanks for the reply.  I was only able to get data into Accumulo after reviewing the
page you provided.
> 
> Lets say for example that I am storing a Name, login, displayName columns as the column
family.  And I have inserted Joe, jd, joey as one record and joey, joe, jd for the second
record.
> 
> mut.put(new Text("Name"), new Text("joe"), cv, new Value("joe");
> mut.put(new Text("login"), new Text("jd"), cv, new Value("jd");
> mut.put(new Text("DisplayName"), new Text("joey"), cv, new Value("joey");
> write(...)
> 
> mut.put(new Text("Name"), new Text("joey"), cv, new Value("joey");
> mut.put(new Text("login"), new Text("joe"), cv, new Value("joe");
> mut.put(new Text("DisplayName"), new Text("jd"), cv, new Value("jd");
> write(...)
> 
> How would I execute a keyword search for "joe" in an attempt to pull back both records
where Joe is the value for Login for one record while "joe" is a value for Name in another?
> 
> The example in the Table Design page shows the search based on the row id.  From my understanding
if I provide the rowId, it will limit the search to that row.  But the example on that page
is essentially just loading a specific row based on a rowid, not a keyword search.
> 
> Thanks for the reply.  I hope my explanation of what I am attempting to do is making
sense.
> 
> Joe
> 
> On Wed, Nov 23, 2011 at 1:55 PM, Aaron Cordova <aaron@cordovas.org> wrote:
> Joe,
> 
> If you haven't already, check out the Table Design section of the Manual
> 
>        http://incubator.apache.org/accumulo/user_manual_1.3-incubating/Table_Design.html
> 
> specifically, the subsection titled 'Indexing'. If you have read this, let us know and
we can clarify.
> 
> Aaron
> 
> 
> On Nov 23, 2011, at 1:46 PM, Jason Rutherglen wrote:
> 
> > The most efficient system would be to implement a secondary [inverted]
> > index on the Accumulo data.
> >
> > May there is a Coprocessor like API that would allow this type of
> > functionality to be implemented?
> >
> > On Wed, Nov 23, 2011 at 1:12 PM, Joey Daughtery
> > <jdaughtery@t-sciences.com> wrote:
> >> All
> >> I am new to Accumulo.  I have figured out how to store the data, load all
> >> based on scanning with new Range(), and loading a specific row based on new
> >> Range(id).  However, if I want to locate a row that has a specific value, I
> >> am not sure how to approach this programmatically.  Can someone give me some
> >> insight on how to do such a scan?
> >>
> >> Also, I have seen several examples of how to populate the Mutation object.
> >> Specifically, I see:
> >> mut.put(new Text("column"), new Text("NAME"), timestamp, new Value("John");
> >>
> >> OR
> >> mut.put(new Text("NAME"), new Text("John"), timestamp, new Value("John);
> >>
> >> Could someone indicate which is the correct way to store the data or
> >> indicate why one would use one approach over the other?
> >>
> >> Thanks
> >>
> >> Joe
> 
> 


Mime
View raw message