incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Re: Facets
Date Fri, 25 Oct 2013 18:30:55 GMT
On Wed, Oct 23, 2013 at 12:40 PM, Colton McInroy <colton@dosarrest.com>wrote:

> I got verification from our board to dedicate some time to working on
> this, so I will start seeing what I can do to figure it out. If you have
> any suggestions on what code I should be looking at for this, please let me
> know. I haven't been programming in java for all that long, or even lucene,
> but I've managed to figure out a lot, so hopefully I can get this done, or
> at the very least help get it done.
>
> I actually taught myself Java for the purposes of this log project. Well,
> this is now the continuation of what I already completed, but this project
> was why I opted to start learning java. I programmed in java many years ago
> when I was young, but at the time did not like it so I went onto other
> programming languages. Finally when I ran into needing a solution to deal
> with the massive amount of log data tried various things, and decided to go
> with learning java and lucene. After building a system which handled our
> needs, as data levels went up, I started to notice problems with scaling.
> After running some tests, sure enough, going with a distributed setup is
> the only way to go, which is what has brought me to this rebuild using blur
> in place of lucene.
>
>
> I may be wrong here, but one major thing I think may come into play with
> using facets is that in my experience with them, facets are essentially
> another index that accompanies the main index. Considering that for the
> main index it is being written to the fs in shards, the same thing would
> probably need to be done with facet data having a separate shard count, or
> perhaps the same shard count for the number of facet indexes. What I
> actually did in the past with my system was make the facet data a sub
> directory off of the index directory. This could be done for index shards.
> For each index shard, have another folder within that index that contains
> the facet index for that shard. This would keep each shard and it's facets
> contained together in the same shard folder.


Very good.  Let's create an issue on what features we want the faceting to
have above what it currently does (which is very basic).  Then we can break
up the various features into sub tasks and proceed from there.

https://issues.apache.org/jira/browse/BLUR

Aaron


>
>
> Thanks,
> Colton McInroy
>
>  * Director of Security Engineering
>
>
> Phone
> (Toll Free)
> _US_    (888)-818-1344 Press 2
> _UK_    0-800-635-0551 Press 2
>
> My Extension    101
> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
> Website         http://www.dosarrest.com
>
> On 10/21/2013 10:56 PM, Colton McInroy wrote:
>
>>
>> Thanks,
>> Colton McInroy
>>
>>  * Director of Security Engineering
>>
>>
>> Phone
>> (Toll Free)
>> _US_     (888)-818-1344 Press 2
>> _UK_     0-800-635-0551 Press 2
>>
>> My Extension     101
>> 24/7 Support     support@dosarrest.com <mailto:support@dosarrest.com>
>> Email     colton@dosarrest.com <mailto:colton@dosarrest.com>
>> Website     http://www.dosarrest.com
>>
>> On 10/21/2013 5:40 PM, Aaron McCurry wrote:
>>
>>> On Mon, Oct 21, 2013 at 4:45 PM, Colton McInroy <colton@dosarrest.com
>>> >wrote:
>>>
>>>  You have any suggestions on how I should deal with needing this type of
>>>> information in the mean time?...
>>>>
>>>> Typically what I used facet data for was to generate graph data. Instead
>>>> of having to go through every match, group it by time, count them up
>>>> manually, etc, I would get facet data for timestamps. For instance, I
>>>> create a query which says "field1:value" I would then have grabbed the
>>>> facets for the Date field use the facet counts to plot a graph with
>>>> timestamp/matches.
>>>>
>>>> I was thinking just go through all of the matches for now, which
>>>> althrough
>>>> probably is not nearly as efficient as going using lucene type facets,
>>>> would get the trick done temporarily until proper facets are
>>>> implemented.
>>>>
>>>>  Agreed, is the date field only a date?  Or does it contain timestamps
>>> as
>>> well?  What is the range of the dates?  Days?  Weeks?  Months? Years?
>>>  All
>>> of the above?
>>>
>> To the second... YYYYMMDDHHmmss
>>
>>>
>>> The reason I ask is basically, if you are looking at let's say a months
>>> worth and you have a time scope on the date field of days.  Then that's
>>> only 30-31 facets that you will have to add manually to the query.
>>>   Obviously as the time scope and range grows this will get a little too
>>> messy to want to deal with on the client side.  Also you can use the
>>> terms
>>> call to get the current terms in a field, so if you want to traverse the
>>> indexed values that can give you that info.
>>>
>> Depends upon the timescale being queried. If the timescale is the past
>> hour, then it would be by minute, if it's over a month, then it would be by
>> hour. For lucene, I just get the facets, and post process them by shrinking
>> the timestamp value down the the level I want.... Such as if I wanted to
>> view hourly counts, I would loop through all of the facet results
>> condensing them down to minute values. Postprocessing the facet results
>> from lucene facets was by far a LOT quicker than going through all of the
>> actual results, which I am betting is probly the case with blur as well.
>> With lucene, facets was what I used the most when trying to present
>> information to GUI interfaces because it makes the most sense when viewing
>> for people.
>>
>>>
>>> Just trying to help get you want you need right now.
>>>
>>>
>>>  Currently the blur site lists facets as being something that works
>>>> here...
>>>>
>>>> http://incubator.apache.org/****blur/how_it_works.html<http://incubator.apache.org/**blur/how_it_works.html>
>>>> <http://**incubator.apache.org/blur/how_**it_works.html<http://incubator.apache.org/blur/how_it_works.html>>
>>>>
>>>>
>>>> But as this thread kinda pointed out, facets the way faceted
>>>> classification describes does not exist right now within apache blur. So
>>>> someone may want to change that to inform that it is currently on the
>>>> todo
>>>> list or something.
>>>>
>>>> http://en.wikipedia.org/wiki/****Faceted_classification<http://en.wikipedia.org/wiki/**Faceted_classification>
>>>> <http:/**/en.wikipedia.org/wiki/**Faceted_classification<http://en.wikipedia.org/wiki/Faceted_classification>>
>>>>
>>>>
>>>> A great example I use to show people what facets are is the following
>>>> site...
>>>>
>>>> http://www.fasttech.com/****category/1499/consumer-****electronics<http://www.fasttech.com/**category/1499/consumer-**electronics>
>>>> <http://www.**fasttech.com/category/1499/**consumer-electronics<http://www.fasttech.com/category/1499/consumer-electronics>>
>>>>
>>>>
>>>> On the left side, it is easy to see a breakdown of all the different
>>>> Fields/Values associated with the current search query. My intention is
>>>> to
>>>> display facet data for all (or the important ones anyway) of the fields
>>>> associated with the current query along with a line graph showing the
>>>> count
>>>> of all matching rows for each time interval. Then the query can be
>>>> refined
>>>> more by querying a specific time range, or field.
>>>>
>>>> Is proper facet implementation something that is has a somewhat high
>>>> priority and will hopefully be at least partially implemented within the
>>>> next couple of weeks/months? Or should I just work on processing all the
>>>> results myself for now? Also, I notice the default query matches is only
>>>> 10, and I see no way to specify unlimited. Can I specify -1 for limited
>>>> or
>>>> something like that, or do I need to specify a really large number that
>>>> will always be higher than the number of actual results I am
>>>> expecting...
>>>> like Long.MAX_VALUE or something?
>>>>
>>>
>>> I agree it is a priority, my top priority is getting 0.2.1 out the door.
>>>   But if we can decide on the API changes that need to be made in the
>>> facet
>>> apit we can begin on it in 0.3.0 at any point.  And once 0.2.1 is
>>> complete
>>> I will be turning my focus on 0.3.0, I hope to call for a vote for 0.2.1
>>> in
>>> the next week.
>>>
>>> Ok, so for queries you can page through the results.  However the facet
>>> count reflect the entire answer.  You can't ask for all the results back
>>> at
>>> once due to memory on constraints within the system.  But you can set in
>>> the BlurQuery object the start and fetch (which is the number to fetch).
>>>
>>> http://incubator.apache.org/**blur/docs/0.2.0/Blur.html#**
>>> Struct_BlurQuery<http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_BlurQuery>
>>>
>> Hmm... yea, when going through say 100,000,000+ rows to generate a graph,
>> it is no doubt going to take a long time though re-querying in 1,000
>> results intervals 100,000+ times. If that's for only 5 minutes of data,
>> it's a huge amount of processing to see general statistics of the data you
>> have in front of you.
>>
>> This is where facets became vital for me. I understand that right now
>> "facets" in blur are not really facets, they are instead additional queries
>> which get run. Not really sure why it was implemented that way, but when
>> you read the lucene documentation (http://lucene.apache.org/**core/4_3_0/<http://lucene.apache.org/core/4_3_0/>)
>> it links to wiki pages about faceted searches as well as a use guide
>> explaining what facets are, the implementation in blur does not match what
>> everything else defines facets as.
>>
>> I'm not sure who or how facets became to be implemented in the current
>> manor, but it does not make sense at all or comply with all definitions of
>> facets I have found. I find this to be a conflict, if blur advertises them
>> but does not really have them. Since there is no documentation about facets
>> really, other than it saying it's in the feature list, it took me a while
>> to discover this. For me in particular, this is vital. What use is indexing
>> massive amounts of information if you do not have very good visibility of
>> it.
>>
>> As I have mentioned, my use is for storing logged events. Let's say you
>> have events for sshd being stored in a table along with the fields Date,
>> LoginMethod, IP, User, Server, and Success. If you have a LOT servers being
>> monitored which have a lot of user login activity. In lucene I would do a
>> single query against any of those fields, or perhaps just start with
>> matching all records. Along with that query, I would get the facets for
>> those fields using Date to display a time graph of activity for the rows. I
>> would then display the top 5-10 facets for each field along with a subquery
>> that does just a facetquery to display another time graph of the Date
>> facets. With this you can instantly see 10 login failures within 100,000
>> successes, how many times each user has logged in and what methods where
>> used, etc. This is a simple example, but expand that out to all kinds of
>> other information and it's night and day visibility of data.
>>
>> When trying to view data of any kind in an effective manor, graphing
>> always helps, but to process every matching row is obviously inefficient. I
>> believe some of the other systems out there such as splunk do that, but
>> when I did my own work, I found that to slow and inefficient. Sure, it
>> works fine when viewing a small amount of data, but when we are talking
>> about big data, which is what Blur is designed for, and what I am working
>> with, it's just to much overhead. Using facets on date values to produce
>> time graphs of entries no matter how many rows/records you produce pretty
>> much is almost instant.
>>
>> In splunk or other search systems, I would see events populated over time
>> in a graph along with the first page of data. The time graph continues to
>> fill over time showing a timeline of data. Depending on your data, this can
>> take a seriously long time. This is no doubt doing what your suggesting
>> with the processing of data one page at a time, sending it to the browser
>> to parse into data stores that display graphs.
>> With facet results, I was able to display the historical timelines in the
>> same amount of time it took to do a single query along with the facet data.
>> There just is no match from what I have seen so far, for Lucene indexes
>> along with facet indexes, which is what got my so excited about blur. I
>> myself literally was in the design phase of writing my own implementation
>> of a distributed lucene index system when I decided to stop and check what
>> was out there before re-inventing the wheel. When I came across the blur
>> project, I found the feature list and looked at two things primarily which
>> got me into starting to work with the project. Those two things were "Fast
>> data ingestion" and "Facets". So far, data seems to be getting pretty
>> quickly in my virtual box tests, which is good. I am going to be scaling up
>> soon once the new hardware requisition is finished. Facets though is
>> currently stopping me from moving forward on some of the code development
>> which requires facets, which is why I am so interested in it's
>> implementation. With looping through records, it could take minutes to get
>> proper visibility of data, whereas with Facets only a couple seconds if
>> that.
>>
>> While waiting, I am going to probably make that IP field type definition
>> I mentioned earlier, as possibly some additional ones. Most of the code for
>> that seems to make sense, but I'll need to load it up in something other
>> than a text editor to really get an appreciation for it. If some of what
>> needs to be done for facets can be explained, I'll perhaps see if I can
>> dedicate some company time to it.
>>
>>>
>>>> Thanks,
>>>> Colton McInroy
>>>>
>>>>   * Director of Security Engineering
>>>>
>>>>
>>>> Phone
>>>> (Toll Free)
>>>> _US_    (888)-818-1344 Press 2
>>>> _UK_    0-800-635-0551 Press 2
>>>>
>>>> My Extension    101
>>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>> Website         http://www.dosarrest.com
>>>>
>>>> On 10/18/2013 8:40 AM, Colton McInroy wrote:
>>>>
>>>>  Hello Aaron,
>>>>>
>>>>>      Yes, that's basically what I was thinking of for the facet
>>>>> results.
>>>>> The current implementation doesn't really make any sense if your coming
>>>>> from lucene. For simplicity and uniformity, I think it should be
>>>>> somewhat
>>>>> like it is with lucene... with adaptation to the way blur is built...
I
>>>>> could kinda see something like this...
>>>>>
>>>>>      public static void queryBlur(String queryString, String table) {
>>>>>          Iface client = BlurClient.getClient(****
>>>>> mainConfig.getString("**
>>>>> controllers"));
>>>>>          Query query = new Query();
>>>>>          query.setQuery(queryString);
>>>>>
>>>>>          Selector selector = new Selector();
>>>>>
>>>>>          // This will fetch all the columns in family "fam0".
>>>>>          selector.****addToColumnFamiliesToFetch("****event");
>>>>>          selector.****addToColumnFamiliesToFetch("****msg");
>>>>>
>>>>>          BlurQuery blurQuery = new BlurQuery();
>>>>>          int matches = 10;
>>>>>          List<Facet> facets = Arrays.asList(new Facet("field1",
>>>>> matches),new Facet("field2", matches));
>>>>>          blurQuery.setFacets(facets);
>>>>>          blurQuery.setFetch(50);
>>>>>          blurQuery.setQuery(query);
>>>>>          blurQuery.setSelector(****selector);
>>>>>
>>>>>          try {
>>>>>              BlurResults results = client.query(table, blurQuery);
>>>>>              for (Facet facet : result.getFacetResults()) {
>>>>>                  System.out.println(facet.name+****" "+facet.value);
>>>>>              }
>>>>>          } catch (BlurException e) {
>>>>>              // TODO Auto-generated catch block
>>>>>              e.printStackTrace();
>>>>>          } catch (TException e) {
>>>>>              // TODO Auto-generated catch block
>>>>>              e.printStackTrace();
>>>>>          }
>>>>>          return null;
>>>>>      }
>>>>>
>>>>>      Just a brief modification from what I am doing now. Basically I
>>>>> just
>>>>> envision a method called getFacetResults which returns List<Facet>
with
>>>>> each Facet object containing a "name" and a "value" which would be the
>>>>> column name and facet count respectively. I'm just throwing this out
>>>>> there
>>>>> for now. This is a different way of implementing the facets than
>>>>> lucene in
>>>>> terms of how the code is accessed, but it would provide the same
>>>>> results.
>>>>>
>>>>>      It could also be done something like this...
>>>>>
>>>>>      List<Facet> facets = Arrays.asList(new Facet("field1"), new
>>>>> Facet("field2"));
>>>>>      blurQuery.setFacets(facets, matches);
>>>>>
>>>>>      Depends if the number of matches should be per facet or per query,
>>>>> although I see the merits in being able to specify the matches for each
>>>>> field.
>>>>>
>>>>> Thanks,
>>>>> Colton McInroy
>>>>>
>>>>>   * Director of Security Engineering
>>>>>
>>>>>
>>>>> Phone
>>>>> (Toll Free)
>>>>> _US_     (888)-818-1344 Press 2
>>>>> _UK_     0-800-635-0551 Press 2
>>>>>
>>>>> My Extension     101
>>>>> 24/7 Support     support@dosarrest.com <mailto:support@dosarrest.com>
>>>>> Email     colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>> Website     http://www.dosarrest.com
>>>>>
>>>>> On 10/18/2013 5:20 AM, Aaron McCurry wrote:
>>>>>
>>>>>  I have an issue in Jira to document facets in 0.2.1, it's not been
>>>>>> worked
>>>>>> yet but I hope I can get to it soon.  It looks like you figured out
>>>>>> what
>>>>>> is
>>>>>> there.
>>>>>>
>>>>>> We will likely improve facets in 0.3.0 so the API will have to change
>>>>>> a
>>>>>> bit.  The biggest change we will need to make is the scenario that
you
>>>>>> bring up.  Facets in the current implementation case are simply other
>>>>>> queries that can range from a single term to a complex query. I'm
>>>>>> assuming
>>>>>> that you would like to specify a field name and get something like
a
>>>>>> map
>>>>>> of
>>>>>> terms to counts for the given facet?
>>>>>>
>>>>>> The field facetCounts are counts that each of the facets in the input
>>>>>> list
>>>>>> from the query.  So the count list corresponds one for one to the
>>>>>> facet
>>>>>> list in the Query.  I realize this is less than ideal and we can
>>>>>> going to
>>>>>> be improving it soon.
>>>>>>
>>>>>> If you have some suggestions on how you would want the facet api
to
>>>>>> operate, new features, or anything else for that matter just write
up
>>>>>> your
>>>>>> thoughts on this thread and we can incorporate them into the task.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Aaron
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 18, 2013 at 6:43 AM, Colton McInroy <colton@dosarrest.com
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>   Ok, so I created this method...
>>>>>>
>>>>>>> public static BlurResults queryBlur(String queryString, String
>>>>>>> table) {
>>>>>>>           Iface client = BlurClient.getClient(****
>>>>>>> mainConfig.getString("**
>>>>>>> controllers"));
>>>>>>>           Query query = new Query();
>>>>>>>           query.setQuery(queryString);
>>>>>>>
>>>>>>>           Selector selector = new Selector();
>>>>>>>
>>>>>>>           // This will fetch all the columns in family "fam0".
>>>>>>> selector.******addToColumnFamiliesToFetch("******event");
>>>>>>> selector.******addToColumnFamiliesToFetch("******msg");
>>>>>>>
>>>>>>>           BlurQuery blurQuery = new BlurQuery();
>>>>>>>           List<Facet> facets = Arrays.asList(new Facet(queryString,
>>>>>>> Long.MAX_VALUE));
>>>>>>>           blurQuery.setFacets(facets);
>>>>>>>           blurQuery.setFetch(50);
>>>>>>>           blurQuery.setQuery(query);
>>>>>>>           blurQuery.setSelector(******selector);
>>>>>>>
>>>>>>>           try {
>>>>>>>               BlurResults results = client.query(table, blurQuery);
>>>>>>>               return results;
>>>>>>>           } catch (BlurException e) {
>>>>>>>               // TODO Auto-generated catch block
>>>>>>>               e.printStackTrace();
>>>>>>>           } catch (TException e) {
>>>>>>>               // TODO Auto-generated catch block
>>>>>>>               e.printStackTrace();
>>>>>>>           }
>>>>>>>           return null;
>>>>>>>       }
>>>>>>>
>>>>>>>   From reading through source code, I was able to find out that
you
>>>>>>> specify
>>>>>>> facets as a list, but this is fairly confusing to me coming from
>>>>>>> lucene.
>>>>>>>
>>>>>>> In lucene when getting facet data, I specify the facet fields
I am
>>>>>>> interested in, and the facet results show me a top X list of
values
>>>>>>> within
>>>>>>> that field. Whereas with blur, it appears that a facet is another
>>>>>>> query
>>>>>>> which gives only a number as a result. When I tried to obtain
the
>>>>>>> facet
>>>>>>> data I am used to with Lucene, the only thing I could find was...
>>>>>>>
>>>>>>> System.out.println("Facet Results: "+results.getFacetCountsSize()**
>>>>>>> ****);
>>>>>>> System.out.println(JSONArray.******toJSONString(results.******getFacetCounts()));
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Could you please elaborate on this.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Colton McInroy
>>>>>>>
>>>>>>>    * Director of Security Engineering
>>>>>>>
>>>>>>>
>>>>>>> Phone
>>>>>>> (Toll Free)
>>>>>>> _US_    (888)-818-1344 Press 2
>>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>>
>>>>>>> My Extension    101
>>>>>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>>>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>>>> Website         http://www.dosarrest.com
>>>>>>>
>>>>>>> On 10/18/2013 3:07 AM, Colton McInroy wrote:
>>>>>>>
>>>>>>>   I think I wrote this to soon, I believe I just found out how
to do
>>>>>>> it.
>>>>>>>
>>>>>>>> I'll test it out and supply some example code if correct
to help
>>>>>>>> others.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Colton McInroy
>>>>>>>>
>>>>>>>>    * Director of Security Engineering
>>>>>>>>
>>>>>>>>
>>>>>>>> Phone
>>>>>>>> (Toll Free)
>>>>>>>> _US_     (888)-818-1344 Press 2
>>>>>>>> _UK_     0-800-635-0551 Press 2
>>>>>>>>
>>>>>>>> My Extension     101
>>>>>>>> 24/7 Support     support@dosarrest.com <mailto:
>>>>>>>> support@dosarrest.com>
>>>>>>>> Email     colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>>>>> Website     http://www.dosarrest.com
>>>>>>>>
>>>>>>>> On 10/18/2013 2:58 AM, Colton McInroy wrote:
>>>>>>>>
>>>>>>>>   Hey Aaron,
>>>>>>>>
>>>>>>>>>       You mentioned a while ago that blur handles facets
as well
>>>>>>>>> and
>>>>>>>>> that
>>>>>>>>> you would provide an example. Unless I have missed that
email, I
>>>>>>>>> haven't
>>>>>>>>> seen an example yet, could you provide one? I just took
a quick
>>>>>>>>> look
>>>>>>>>> myself
>>>>>>>>> and could not figure it out. I see there is an example
>>>>>>>>> FacetQueryTest.java
>>>>>>>>> in blur-query but that appears to be basically just a
copy of the
>>>>>>>>> lucene
>>>>>>>>> file.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message