Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@accumulo.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: "THORMAN, ROBERT D" <rt2357@att.com>
To: "dev@accumulo.apache.org" <dev@accumulo.apache.org>
Subject: Re: Search
Thread-Topic: Search
Thread-Index: 
 AQHPofBsjW+c1Zoy9EKo/+2sN9XY6puk9EuAgAABCgCAAc7bAIAHuC8AgAAZAICAARFZAIAABMMAgAACL4D//8QfAIAAWVuA///3OwA=
Date: Thu, 24 Jul 2014 21:06:00 +0000
Message-ID: <CFF6DEEB.F0A%rt2357@att.com>
References: <CFED85FD.CB9%rt2357@att.com> <53C81EA5.7010302@gmail.com>
 <211D3DDA-7209-48DD-BC6B-55F6A7D4158A@clearedgeit.com>
 <CALn26R==MSiQD+JNZ5mye+cbk-0vyY22nVzdT40p-6KMY5cigw@mail.gmail.com>
 <CA+zmODi_yr+5FwguqcOGjygzQjp8NnbV2yah-q7M0=eEHMosAQ@mail.gmail.com>
 <CAOiJXP7czzoOTXo6D-N=Wn0CoyEC3oU+jd+JVMKC6CLbhBFPuA@mail.gmail.com>
 <CA+zmODg76Ymb8aoLht=w5Fy36_mY=i9SO912ZLHAJXw24Q3RuQ@mail.gmail.com>
 <CANndRmJGQ9RgJu2fAqRRewf3Od8t4cvNvHy73Q=8uDU-euJ2ig@mail.gmail.com>
 <6ECEE46D-5A6F-44F8-B629-EFC24B96618D@ll.mit.edu>
 <CFF69B8D.ED6%rt2357@att.com>
 <CALn26RksRA4n-HGL1OnW___HAxHdrr-rHnqGnj-gOzzObH2asg@mail.gmail.com>
In-Reply-To: 
 <CALn26RksRA4n-HGL1OnW___HAxHdrr-rHnqGnj-gOzzObH2asg@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-ID: <0FF0D558937A214CB5C9A0E0BFE77C65@LOCAL>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Yes, you have missed my original request.  I need a fast way (i.e.
Pre-indexed) to perform lexical searches on row values without using a
regex based iterator.  I also do not want to duplicate data from the
cluster onto a document based strategy that is typically required by
packages like Apache Lucene.

v/r
Bob Thorman
Principal Big Data Engineer
AT&T Big Data CoE
2900 W. Plano Parkway
Plano, TX 75075
972-658-1714


On 7/24/14, 11:37 AM, "Nehal Mehta" <nehal413@gmail.com> wrote:

>If we have two streams, we would just store data into Accumulo and use it
>as backend. What we are/were trying to implement was secure search. So if
>user does not have rights to search that cell, user can see other listing
>but not one which is inaccessible. By doing so we would add lot more
>value.
>
>Am I missing something?
>
>
>On Thu, Jul 24, 2014 at 12:17 PM, THORMAN, ROBERT D <rt2357@att.com>
>wrote:
>
>> Search the terms (words, phases, sub-strings, combinations) of the row
>> values.  Lucene is an apache project that does document indexing on
>>terms.
>>
>> v/r
>> Bob Thorman
>> Principal Big Data Engineer
>> AT&T Big Data CoE
>> 2900 W. Plano Parkway
>> Plano, TX 75075
>> 972-658-1714
>>
>>
>>
>>
>>
>>
>> On 7/24/14, 9:52 AM, "Kepner, Jeremy - 0553 - MITLL" <kepner@ll.mit.edu>
>> wrote:
>>
>> >What is meant by lexical search? Lucene style?
>> >
>> >http://www.lucenetutorial.com/lucene-query-syntax.html
>> >
>> >If so, these searches could be prioritized (not all are particularly
>> >useful), and it shouldn't be too hard to come up with recommended
>> >Accumulo approaches for the most important lexical searches.
>> >
>> >On Jul 24, 2014, at 10:44 AM, Donald Miner <dminer@clearedgeit.com>
>> wrote:
>> >
>> >> One problem I ran into when thinking about this problem is
>>throughput.
>> >>In
>> >> accumulo, we talk about tens or hundreds of thousands or millions of
>> >> records per second. A lot of these search solutions talk about
>>hundreds
>> >>or
>> >> thousands of documents per second.
>> >>
>> >> This problem that Accumulo is able to outpace just about anything
>>lead
>> >>me
>> >> to think that some sort of microbatch solution might be the best
>> >>choice. If
>> >> you wait for your data to be indexed before moving on to the next
>> >>Accumulo
>> >> insert you can start lagging behind. Basically, you are crippling
>>your
>> >> ingest throughput by making it the slower of the two systems.
>> >>
>> >> It seems like a more microbatch (or batch) approach might be
>> >>worthwhile--
>> >> what you are trading is your text index lagging behind, but you keep
>> >>your
>> >> ingest throughput in Accumulo. I think Apache Blur does batch
>>parallel
>> >> indexing, which is why I was looking at it for this.
>> >>
>> >>
>> >> On Thu, Jul 24, 2014 at 10:27 AM, Roshan Punnoose <roshanp@gmail.com>
>> >>wrote:
>> >>
>> >>> Yeah I think David's solution is the best. Though I like the idea of
>> >>>having
>> >>> a server side Constraint or hook that puts the updates into the
>>queue.
>> >>>
>> >>> The Cassandra work I had seen actually tightly couples a Cassandra
>> >>>node to
>> >>> a Solr shard. So all the data that exists on that specific node also
>> >>>exists
>> >>> on that specific Solr shard. Would be pretty cool to do the same
>>thing
>> >>>with
>> >>> a tablet server =3D> local Solr shard.
>> >>>
>> >>>
>> >>> On Wed, Jul 23, 2014 at 6:09 PM, David Medinets
>> >>><david.medinets@gmail.com>
>> >>> wrote:
>> >>>
>> >>>> Ingest to a queue. Have two processes subscribe to the queue. One
>> >>>> pushing into Accumulo and the other pushing into SolrCloud. Why
>> >>>> tightly couple the capabilities?
>> >>>>
>> >>>> On Wed, Jul 23, 2014 at 4:39 PM, Roshan Punnoose
>><roshanp@gmail.com>
>> >>>> wrote:
>> >>>>> Is there a way to tie into the write process in Accumulo? Maybe
>>just
>> >>> use
>> >>>> an
>> >>>>> Iterator that worked on compaction to send data to blur/solr? I
>>have
>> >>> seen
>> >>>>> something similar in Cassandra, a data hook to save data in Solr.
>> >>>>>
>> >>>>>
>> >>>>> On Fri, Jul 18, 2014 at 6:46 PM, Nehal Mehta <nehal413@gmail.com>
>> >>> wrote:
>> >>>>>
>> >>>>>> We were trying to do so, but adding visibility while
>> >>>>>>adding/searching
>> >>>>>> documents needs lot more thinking. Adding visibility to core
>>search
>> >>>> engine
>> >>>>>> needs changes to algorithm and that does not make it very
>>scalable.
>> >>>>>> Integration besides granular visibility is very doable. and we
>>had
>> >>> taken
>> >>>>>> inspiration from Solandra.
>> >>>>>>
>> >>>>>> Obviously if we can get it done it adds lot of value. I believe
>> >>>>>>Sqrrl
>> >>>>>> people have already done it, are they thinking to open source it
>> >>>> anytime in
>> >>>>>> future?
>> >>>>>>
>> >>>>>>
>> >>>>>> On Thu, Jul 17, 2014 at 3:09 PM, Donald Miner
>> >>>>>><dminer@clearedgeit.com
>> >>>>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> We briefly toyed with blur on accumulo but didnt get too far
>>just
>> >>>> because
>> >>>>>>> it was obe. I think that would be cool.
>> >>>>>>>
>> >>>>>>>> On Jul 17, 2014, at 3:06 PM, Josh Elser <josh.elser@gmail.com>
>> >>>> wrote:
>> >>>>>>>>
>> >>>>>>>> It's definitely possible. I remember hearing about someone
>>doing
>> >>>> lucene
>> >>>>>>> on top of Accumulo once, but I don't recall seeing a nice
>>package
>> >>>> with a
>> >>>>>>> bow on top.
>> >>>>>>>>
>> >>>>>>>>> On 7/17/14, 2:53 PM, THORMAN, ROBERT D wrote:
>> >>>>>>>>> What lexical search package (like lucene/solr) has anyone put
>>on
>> >>>> top
>> >>>>>> of
>> >>>>>>> accumulo?  Is this possible or does everyone just index log
>>files
>> >>> and
>> >>>>>>> documents?
>> >>>>>>>>>
>> >>>>>>>>> v/r
>> >>>>>>>>> Bob Thorman
>> >>>>>>>>> Principal Big Data Engineer
>> >>>>>>>>> AT&T Big Data CoE
>> >>>>>>>>> 2900 W. Plano Parkway
>> >>>>>>>>> Plano, TX 75075
>> >>>>>>>>> 972-658-1714
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> Donald Miner
>> >> Chief Technology Officer
>> >> ClearEdge IT Solutions, LLC
>> >> Cell: 443 799 7807
>> >> www.clearedgeit.com
>> >
>>
>>