incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <>
Subject Re: Facets
Date Fri, 25 Oct 2013 19:31:56 GMT
I am willing to look at anything. My biggest concern is to get the API right first which requires
knowing what features are really desired. So that why I suggested a jira issue to discuss
the feature set. 


Sent from my iPhone

On Oct 25, 2013, at 3:20 PM, Otis Gospodnetic <> wrote:

> I only skimmed this thread. Nobody seems to have mentioned Lucene's own
> faceting, which merits looking into.
> Otis
> Solr & ElasticSearch Support
> On Oct 22, 2013 1:56 AM, "Colton McInroy" <> wrote:
>> Thanks,
>> Colton McInroy
>> * Director of Security Engineering
>> Phone
>> (Toll Free)
>> _US_    (888)-818-1344 Press 2
>> _UK_    0-800-635-0551 Press 2
>> My Extension    101
>> 24/7 Support <>
>> Email <>
>> Website
>> On 10/21/2013 5:40 PM, Aaron McCurry wrote:
>>> On Mon, Oct 21, 2013 at 4:45 PM, Colton McInroy <
>>>> wrote:
>>> You have any suggestions on how I should deal with needing this type of
>>>> information in the mean time?...
>>>> Typically what I used facet data for was to generate graph data. Instead
>>>> of having to go through every match, group it by time, count them up
>>>> manually, etc, I would get facet data for timestamps. For instance, I
>>>> create a query which says "field1:value" I would then have grabbed the
>>>> facets for the Date field use the facet counts to plot a graph with
>>>> timestamp/matches.
>>>> I was thinking just go through all of the matches for now, which
>>>> althrough
>>>> probably is not nearly as efficient as going using lucene type facets,
>>>> would get the trick done temporarily until proper facets are implemented.
>>>> Agreed, is the date field only a date?  Or does it contain timestamps as
>>> well?  What is the range of the dates?  Days?  Weeks?  Months?  Years?
>>> All
>>> of the above?
>> To the second... YYYYMMDDHHmmss
>>> The reason I ask is basically, if you are looking at let's say a months
>>> worth and you have a time scope on the date field of days.  Then that's
>>> only 30-31 facets that you will have to add manually to the query.
>>>  Obviously as the time scope and range grows this will get a little too
>>> messy to want to deal with on the client side.  Also you can use the terms
>>> call to get the current terms in a field, so if you want to traverse the
>>> indexed values that can give you that info.
>> Depends upon the timescale being queried. If the timescale is the past
>> hour, then it would be by minute, if it's over a month, then it would be by
>> hour. For lucene, I just get the facets, and post process them by shrinking
>> the timestamp value down the the level I want.... Such as if I wanted to
>> view hourly counts, I would loop through all of the facet results
>> condensing them down to minute values. Postprocessing the facet results
>> from lucene facets was by far a LOT quicker than going through all of the
>> actual results, which I am betting is probly the case with blur as well.
>> With lucene, facets was what I used the most when trying to present
>> information to GUI interfaces because it makes the most sense when viewing
>> for people.
>>> Just trying to help get you want you need right now.
>>> Currently the blur site lists facets as being something that works
>>>> here...
>>>> <http://****it_works.html<>
>>>> But as this thread kinda pointed out, facets the way faceted
>>>> classification describes does not exist right now within apache blur. So
>>>> someone may want to change that to inform that it is currently on the
>>>> todo
>>>> list or something.
>>>> <http:/**/**Faceted_classification<>
>>>> A great example I use to show people what facets are is the following
>>>> site...
>>>> <http://www.****consumer-electronics<>
>>>> On the left side, it is easy to see a breakdown of all the different
>>>> Fields/Values associated with the current search query. My intention is
>>>> to
>>>> display facet data for all (or the important ones anyway) of the fields
>>>> associated with the current query along with a line graph showing the
>>>> count
>>>> of all matching rows for each time interval. Then the query can be
>>>> refined
>>>> more by querying a specific time range, or field.
>>>> Is proper facet implementation something that is has a somewhat high
>>>> priority and will hopefully be at least partially implemented within the
>>>> next couple of weeks/months? Or should I just work on processing all the
>>>> results myself for now? Also, I notice the default query matches is only
>>>> 10, and I see no way to specify unlimited. Can I specify -1 for limited
>>>> or
>>>> something like that, or do I need to specify a really large number that
>>>> will always be higher than the number of actual results I am expecting...
>>>> like Long.MAX_VALUE or something?
>>> I agree it is a priority, my top priority is getting 0.2.1 out the door.
>>>  But if we can decide on the API changes that need to be made in the
>>> facet
>>> apit we can begin on it in 0.3.0 at any point.  And once 0.2.1 is complete
>>> I will be turning my focus on 0.3.0, I hope to call for a vote for 0.2.1
>>> in
>>> the next week.
>>> Ok, so for queries you can page through the results.  However the facet
>>> count reflect the entire answer.  You can't ask for all the results back
>>> at
>>> once due to memory on constraints within the system.  But you can set in
>>> the BlurQuery object the start and fetch (which is the number to fetch).
>>> Struct_BlurQuery<>
>> Hmm... yea, when going through say 100,000,000+ rows to generate a graph,
>> it is no doubt going to take a long time though re-querying in 1,000
>> results intervals 100,000+ times. If that's for only 5 minutes of data,
>> it's a huge amount of processing to see general statistics of the data you
>> have in front of you.
>> This is where facets became vital for me. I understand that right now
>> "facets" in blur are not really facets, they are instead additional queries
>> which get run. Not really sure why it was implemented that way, but when
>> you read the lucene documentation (**core/4_3_0/<>)
>> it links to wiki pages about faceted searches as well as a use guide
>> explaining what facets are, the implementation in blur does not match what
>> everything else defines facets as.
>> I'm not sure who or how facets became to be implemented in the current
>> manor, but it does not make sense at all or comply with all definitions of
>> facets I have found. I find this to be a conflict, if blur advertises them
>> but does not really have them. Since there is no documentation about facets
>> really, other than it saying it's in the feature list, it took me a while
>> to discover this. For me in particular, this is vital. What use is indexing
>> massive amounts of information if you do not have very good visibility of
>> it.
>> As I have mentioned, my use is for storing logged events. Let's say you
>> have events for sshd being stored in a table along with the fields Date,
>> LoginMethod, IP, User, Server, and Success. If you have a LOT servers being
>> monitored which have a lot of user login activity. In lucene I would do a
>> single query against any of those fields, or perhaps just start with
>> matching all records. Along with that query, I would get the facets for
>> those fields using Date to display a time graph of activity for the rows. I
>> would then display the top 5-10 facets for each field along with a subquery
>> that does just a facetquery to display another time graph of the Date
>> facets. With this you can instantly see 10 login failures within 100,000
>> successes, how many times each user has logged in and what methods where
>> used, etc. This is a simple example, but expand that out to all kinds of
>> other information and it's night and day visibility of data.
>> When trying to view data of any kind in an effective manor, graphing
>> always helps, but to process every matching row is obviously inefficient. I
>> believe some of the other systems out there such as splunk do that, but
>> when I did my own work, I found that to slow and inefficient. Sure, it
>> works fine when viewing a small amount of data, but when we are talking
>> about big data, which is what Blur is designed for, and what I am working
>> with, it's just to much overhead. Using facets on date values to produce
>> time graphs of entries no matter how many rows/records you produce pretty
>> much is almost instant.
>> In splunk or other search systems, I would see events populated over time
>> in a graph along with the first page of data. The time graph continues to
>> fill over time showing a timeline of data. Depending on your data, this can
>> take a seriously long time. This is no doubt doing what your suggesting
>> with the processing of data one page at a time, sending it to the browser
>> to parse into data stores that display graphs.
>> With facet results, I was able to display the historical timelines in the
>> same amount of time it took to do a single query along with the facet data.
>> There just is no match from what I have seen so far, for Lucene indexes
>> along with facet indexes, which is what got my so excited about blur. I
>> myself literally was in the design phase of writing my own implementation
>> of a distributed lucene index system when I decided to stop and check what
>> was out there before re-inventing the wheel. When I came across the blur
>> project, I found the feature list and looked at two things primarily which
>> got me into starting to work with the project. Those two things were "Fast
>> data ingestion" and "Facets". So far, data seems to be getting pretty
>> quickly in my virtual box tests, which is good. I am going to be scaling up
>> soon once the new hardware requisition is finished. Facets though is
>> currently stopping me from moving forward on some of the code development
>> which requires facets, which is why I am so interested in it's
>> implementation. With looping through records, it could take minutes to get
>> proper visibility of data, whereas with Facets only a couple seconds if
>> that.
>> While waiting, I am going to probably make that IP field type definition I
>> mentioned earlier, as possibly some additional ones. Most of the code for
>> that seems to make sense, but I'll need to load it up in something other
>> than a text editor to really get an appreciation for it. If some of what
>> needs to be done for facets can be explained, I'll perhaps see if I can
>> dedicate some company time to it.
>>>> Thanks,
>>>> Colton McInroy
>>>>  * Director of Security Engineering
>>>> Phone
>>>> (Toll Free)
>>>> _US_    (888)-818-1344 Press 2
>>>> _UK_    0-800-635-0551 Press 2
>>>> My Extension    101
>>>> 24/7 Support <>
>>>> Email <>
>>>> Website
>>>> On 10/18/2013 8:40 AM, Colton McInroy wrote:
>>>> Hello Aaron,
>>>>>     Yes, that's basically what I was thinking of for the facet results.
>>>>> The current implementation doesn't really make any sense if your coming
>>>>> from lucene. For simplicity and uniformity, I think it should be
>>>>> somewhat
>>>>> like it is with lucene... with adaptation to the way blur is built...
>>>>> could kinda see something like this...
>>>>>     public static void queryBlur(String queryString, String table) {
>>>>>         Iface client = BlurClient.getClient(****
>>>>> mainConfig.getString("**
>>>>> controllers"));
>>>>>         Query query = new Query();
>>>>>         query.setQuery(queryString);
>>>>>         Selector selector = new Selector();
>>>>>         // This will fetch all the columns in family "fam0".
>>>>>         selector.****addToColumnFamiliesToFetch("****event");
>>>>>         selector.****addToColumnFamiliesToFetch("****msg");
>>>>>         BlurQuery blurQuery = new BlurQuery();
>>>>>         int matches = 10;
>>>>>         List<Facet> facets = Arrays.asList(new Facet("field1",
>>>>> matches),new Facet("field2", matches));
>>>>>         blurQuery.setFacets(facets);
>>>>>         blurQuery.setFetch(50);
>>>>>         blurQuery.setQuery(query);
>>>>>         blurQuery.setSelector(****selector);
>>>>>         try {
>>>>>             BlurResults results = client.query(table, blurQuery);
>>>>>             for (Facet facet : result.getFacetResults()) {
>>>>>                 System.out.println(****" "+facet.value);
>>>>>             }
>>>>>         } catch (BlurException e) {
>>>>>             // TODO Auto-generated catch block
>>>>>             e.printStackTrace();
>>>>>         } catch (TException e) {
>>>>>             // TODO Auto-generated catch block
>>>>>             e.printStackTrace();
>>>>>         }
>>>>>         return null;
>>>>>     }
>>>>>     Just a brief modification from what I am doing now. Basically I
>>>>> just
>>>>> envision a method called getFacetResults which returns List<Facet>
>>>>> each Facet object containing a "name" and a "value" which would be the
>>>>> column name and facet count respectively. I'm just throwing this out
>>>>> there
>>>>> for now. This is a different way of implementing the facets than lucene
>>>>> in
>>>>> terms of how the code is accessed, but it would provide the same
>>>>> results.
>>>>>     It could also be done something like this...
>>>>>     List<Facet> facets = Arrays.asList(new Facet("field1"), new
>>>>> Facet("field2"));
>>>>>     blurQuery.setFacets(facets, matches);
>>>>>     Depends if the number of matches should be per facet or per query,
>>>>> although I see the merits in being able to specify the matches for each
>>>>> field.
>>>>> Thanks,
>>>>> Colton McInroy
>>>>>  * Director of Security Engineering
>>>>> Phone
>>>>> (Toll Free)
>>>>> _US_     (888)-818-1344 Press 2
>>>>> _UK_     0-800-635-0551 Press 2
>>>>> My Extension     101
>>>>> 24/7 Support <>
>>>>> Email <>
>>>>> Website
>>>>> On 10/18/2013 5:20 AM, Aaron McCurry wrote:
>>>>> I have an issue in Jira to document facets in 0.2.1, it's not been
>>>>>> worked
>>>>>> yet but I hope I can get to it soon.  It looks like you figured out
>>>>>> what
>>>>>> is
>>>>>> there.
>>>>>> We will likely improve facets in 0.3.0 so the API will have to change
>>>>>> bit.  The biggest change we will need to make is the scenario that
>>>>>> bring up.  Facets in the current implementation case are simply other
>>>>>> queries that can range from a single term to a complex query. I'm
>>>>>> assuming
>>>>>> that you would like to specify a field name and get something like
>>>>>> map
>>>>>> of
>>>>>> terms to counts for the given facet?
>>>>>> The field facetCounts are counts that each of the facets in the input
>>>>>> list
>>>>>> from the query.  So the count list corresponds one for one to the
>>>>>> list in the Query.  I realize this is less than ideal and we can
>>>>>> to
>>>>>> be improving it soon.
>>>>>> If you have some suggestions on how you would want the facet api
>>>>>> operate, new features, or anything else for that matter just write
>>>>>> your
>>>>>> thoughts on this thread and we can incorporate them into the task.
>>>>>> Thanks!
>>>>>> Aaron
>>>>>> On Fri, Oct 18, 2013 at 6:43 AM

View raw message