Return-Path: X-Original-To: apmail-accumulo-dev-archive@www.apache.org Delivered-To: apmail-accumulo-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ED97A11255 for ; Thu, 24 Jul 2014 21:06:55 +0000 (UTC) Received: (qmail 37693 invoked by uid 500); 24 Jul 2014 21:06:55 -0000 Delivered-To: apmail-accumulo-dev-archive@accumulo.apache.org Received: (qmail 37652 invoked by uid 500); 24 Jul 2014 21:06:55 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Received: (qmail 37640 invoked by uid 99); 24 Jul 2014 21:06:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jul 2014 21:06:55 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [209.65.160.93] (HELO nbfkord-smmo07.seg.att.com) (209.65.160.93) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jul 2014 21:06:52 +0000 Received: from unknown [144.160.229.24] (EHLO alpi155.enaf.aldc.att.com) by nbfkord-smmo07.seg.att.com(mxl_mta-7.2.2-0) over TLS secured channel with ESMTP id 15571d35.0.1948710.00-2315.4960351.nbfkord-smmo07.seg.att.com (envelope-from ); Thu, 24 Jul 2014 21:06:26 +0000 (UTC) X-MXL-Hash: 53d1755271dfa53d-957955d5ae58597a45fe0169a7be50b459a34616 Received: from enaf.aldc.att.com (localhost [127.0.0.1]) by alpi155.enaf.aldc.att.com (8.14.5/8.14.5) with ESMTP id s6OL6PSo004068 for ; Thu, 24 Jul 2014 17:06:25 -0400 Received: from mlpi408.sfdc.sbc.com (mlpi408.sfdc.sbc.com [130.9.128.240]) by alpi155.enaf.aldc.att.com (8.14.5/8.14.5) with ESMTP id s6OL6IcC003945 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Thu, 24 Jul 2014 17:06:19 -0400 Received: from MOKSCY3MSGHUB9A.ITServices.sbc.com (MOKSCY3MSGHUB9A.itservices.sbc.com [135.188.11.75]) by mlpi408.sfdc.sbc.com (RSA Interceptor) for ; Thu, 24 Jul 2014 21:06:01 GMT Received: from MOKSCY3MSGUSRHH.ITServices.sbc.com ([169.254.8.38]) by MOKSCY3MSGHUB9A.ITServices.sbc.com ([135.188.11.75]) with mapi id 14.03.0174.001; Thu, 24 Jul 2014 16:06:00 -0500 From: "THORMAN, ROBERT D" To: "dev@accumulo.apache.org" Subject: Re: Search Thread-Topic: Search Thread-Index: AQHPofBsjW+c1Zoy9EKo/+2sN9XY6puk9EuAgAABCgCAAc7bAIAHuC8AgAAZAICAARFZAIAABMMAgAACL4D//8QfAIAAWVuA///3OwA= Date: Thu, 24 Jul 2014 21:06:00 +0000 Message-ID: References: <53C81EA5.7010302@gmail.com> <211D3DDA-7209-48DD-BC6B-55F6A7D4158A@clearedgeit.com> <6ECEE46D-5A6F-44F8-B629-EFC24B96618D@ll.mit.edu> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [135.210.225.98] Content-Type: text/plain; charset="us-ascii" Content-ID: <0FF0D558937A214CB5C9A0E0BFE77C65@LOCAL> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-RSA-Inspected: yes X-RSA-Classifications: public X-AnalysisOut: [v=2.0 cv=Y+xPRGiN c=1 sm=1 a=dhB6nF3YHL5t/Ixux6cINA==:17 a] X-AnalysisOut: [=ofMgfj31e3cA:10 a=7fJVMtlQ2s4A:10 a=BLceEmwcHowA:10 a=kj9] X-AnalysisOut: [zAlcOel0A:10 a=zQP7CpKOAAAA:8 a=pGLkceISAAAA:8 a=JVAPa6ltA] X-AnalysisOut: [AAA:8 a=5L1ViR4ZAAAA:8 a=2SiRcOW64tL2i-Ok95QA:9 a=CjuIK1q_] X-AnalysisOut: [8ugA:10 a=d3wjIzz8-mQA:10 a=ldWwsUb70YIA:10 a=iSYB66NHVFQA] X-AnalysisOut: [:10 a=MSl-tDqOz04A:10 a=Hz7IrDYlS0cA:10 a=7f50zUBqVtAA:10] X-Spam: [F=0.2000000000; CM=0.500; S=0.200(2014051901)] X-MAIL-FROM: X-SOURCE-IP: [144.160.229.24] X-Virus-Checked: Checked by ClamAV on apache.org Yes, you have missed my original request. I need a fast way (i.e. Pre-indexed) to perform lexical searches on row values without using a regex based iterator. I also do not want to duplicate data from the cluster onto a document based strategy that is typically required by packages like Apache Lucene. v/r Bob Thorman Principal Big Data Engineer AT&T Big Data CoE 2900 W. Plano Parkway Plano, TX 75075 972-658-1714 On 7/24/14, 11:37 AM, "Nehal Mehta" wrote: >If we have two streams, we would just store data into Accumulo and use it >as backend. What we are/were trying to implement was secure search. So if >user does not have rights to search that cell, user can see other listing >but not one which is inaccessible. By doing so we would add lot more >value. > >Am I missing something? > > >On Thu, Jul 24, 2014 at 12:17 PM, THORMAN, ROBERT D >wrote: > >> Search the terms (words, phases, sub-strings, combinations) of the row >> values. Lucene is an apache project that does document indexing on >>terms. >> >> v/r >> Bob Thorman >> Principal Big Data Engineer >> AT&T Big Data CoE >> 2900 W. Plano Parkway >> Plano, TX 75075 >> 972-658-1714 >> >> >> >> >> >> >> On 7/24/14, 9:52 AM, "Kepner, Jeremy - 0553 - MITLL" >> wrote: >> >> >What is meant by lexical search? Lucene style? >> > >> >http://www.lucenetutorial.com/lucene-query-syntax.html >> > >> >If so, these searches could be prioritized (not all are particularly >> >useful), and it shouldn't be too hard to come up with recommended >> >Accumulo approaches for the most important lexical searches. >> > >> >On Jul 24, 2014, at 10:44 AM, Donald Miner >> wrote: >> > >> >> One problem I ran into when thinking about this problem is >>throughput. >> >>In >> >> accumulo, we talk about tens or hundreds of thousands or millions of >> >> records per second. A lot of these search solutions talk about >>hundreds >> >>or >> >> thousands of documents per second. >> >> >> >> This problem that Accumulo is able to outpace just about anything >>lead >> >>me >> >> to think that some sort of microbatch solution might be the best >> >>choice. If >> >> you wait for your data to be indexed before moving on to the next >> >>Accumulo >> >> insert you can start lagging behind. Basically, you are crippling >>your >> >> ingest throughput by making it the slower of the two systems. >> >> >> >> It seems like a more microbatch (or batch) approach might be >> >>worthwhile-- >> >> what you are trading is your text index lagging behind, but you keep >> >>your >> >> ingest throughput in Accumulo. I think Apache Blur does batch >>parallel >> >> indexing, which is why I was looking at it for this. >> >> >> >> >> >> On Thu, Jul 24, 2014 at 10:27 AM, Roshan Punnoose >> >>wrote: >> >> >> >>> Yeah I think David's solution is the best. Though I like the idea of >> >>>having >> >>> a server side Constraint or hook that puts the updates into the >>queue. >> >>> >> >>> The Cassandra work I had seen actually tightly couples a Cassandra >> >>>node to >> >>> a Solr shard. So all the data that exists on that specific node also >> >>>exists >> >>> on that specific Solr shard. Would be pretty cool to do the same >>thing >> >>>with >> >>> a tablet server =3D> local Solr shard. >> >>> >> >>> >> >>> On Wed, Jul 23, 2014 at 6:09 PM, David Medinets >> >>> >> >>> wrote: >> >>> >> >>>> Ingest to a queue. Have two processes subscribe to the queue. One >> >>>> pushing into Accumulo and the other pushing into SolrCloud. Why >> >>>> tightly couple the capabilities? >> >>>> >> >>>> On Wed, Jul 23, 2014 at 4:39 PM, Roshan Punnoose >> >> >>>> wrote: >> >>>>> Is there a way to tie into the write process in Accumulo? Maybe >>just >> >>> use >> >>>> an >> >>>>> Iterator that worked on compaction to send data to blur/solr? I >>have >> >>> seen >> >>>>> something similar in Cassandra, a data hook to save data in Solr. >> >>>>> >> >>>>> >> >>>>> On Fri, Jul 18, 2014 at 6:46 PM, Nehal Mehta >> >>> wrote: >> >>>>> >> >>>>>> We were trying to do so, but adding visibility while >> >>>>>>adding/searching >> >>>>>> documents needs lot more thinking. Adding visibility to core >>search >> >>>> engine >> >>>>>> needs changes to algorithm and that does not make it very >>scalable. >> >>>>>> Integration besides granular visibility is very doable. and we >>had >> >>> taken >> >>>>>> inspiration from Solandra. >> >>>>>> >> >>>>>> Obviously if we can get it done it adds lot of value. I believe >> >>>>>>Sqrrl >> >>>>>> people have already done it, are they thinking to open source it >> >>>> anytime in >> >>>>>> future? >> >>>>>> >> >>>>>> >> >>>>>> On Thu, Jul 17, 2014 at 3:09 PM, Donald Miner >> >>>>>>> >>>> >> >>>>>> wrote: >> >>>>>> >> >>>>>>> We briefly toyed with blur on accumulo but didnt get too far >>just >> >>>> because >> >>>>>>> it was obe. I think that would be cool. >> >>>>>>> >> >>>>>>>> On Jul 17, 2014, at 3:06 PM, Josh Elser >> >>>> wrote: >> >>>>>>>> >> >>>>>>>> It's definitely possible. I remember hearing about someone >>doing >> >>>> lucene >> >>>>>>> on top of Accumulo once, but I don't recall seeing a nice >>package >> >>>> with a >> >>>>>>> bow on top. >> >>>>>>>> >> >>>>>>>>> On 7/17/14, 2:53 PM, THORMAN, ROBERT D wrote: >> >>>>>>>>> What lexical search package (like lucene/solr) has anyone put >>on >> >>>> top >> >>>>>> of >> >>>>>>> accumulo? Is this possible or does everyone just index log >>files >> >>> and >> >>>>>>> documents? >> >>>>>>>>> >> >>>>>>>>> v/r >> >>>>>>>>> Bob Thorman >> >>>>>>>>> Principal Big Data Engineer >> >>>>>>>>> AT&T Big Data CoE >> >>>>>>>>> 2900 W. Plano Parkway >> >>>>>>>>> Plano, TX 75075 >> >>>>>>>>> 972-658-1714 >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>> >> >>>>>> >> >>>> >> >>> >> >> >> >> >> >> >> >> -- >> >> >> >> Donald Miner >> >> Chief Technology Officer >> >> ClearEdge IT Solutions, LLC >> >> Cell: 443 799 7807 >> >> www.clearedgeit.com >> > >> >>