incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colton McInroy <col...@dosarrest.com>
Subject Re: Facets
Date Fri, 25 Oct 2013 19:39:01 GMT
Umm... isn't that what I did? I mentioned it a few times, supplied a 
link to the lucene documentation, etc.

Thanks,
Colton McInroy

  * Director of Security Engineering

	
Phone
(Toll Free) 	
_US_ 	(888)-818-1344 Press 2
_UK_ 	0-800-635-0551 Press 2

My Extension 	101
24/7 Support 	support@dosarrest.com <mailto:support@dosarrest.com>
Email 	colton@dosarrest.com <mailto:colton@dosarrest.com>
Website 	http://www.dosarrest.com

On 10/25/2013 12:20 PM, Otis Gospodnetic wrote:
> I only skimmed this thread. Nobody seems to have mentioned Lucene's own
> faceting, which merits looking into.
>
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On Oct 22, 2013 1:56 AM, "Colton McInroy" <colton@dosarrest.com> wrote:
>
>> Thanks,
>> Colton McInroy
>>
>>   * Director of Security Engineering
>>
>>
>> Phone
>> (Toll Free)
>> _US_    (888)-818-1344 Press 2
>> _UK_    0-800-635-0551 Press 2
>>
>> My Extension    101
>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>> Website         http://www.dosarrest.com
>>
>> On 10/21/2013 5:40 PM, Aaron McCurry wrote:
>>
>>> On Mon, Oct 21, 2013 at 4:45 PM, Colton McInroy <colton@dosarrest.com
>>>> wrote:
>>>   You have any suggestions on how I should deal with needing this type of
>>>> information in the mean time?...
>>>>
>>>> Typically what I used facet data for was to generate graph data. Instead
>>>> of having to go through every match, group it by time, count them up
>>>> manually, etc, I would get facet data for timestamps. For instance, I
>>>> create a query which says "field1:value" I would then have grabbed the
>>>> facets for the Date field use the facet counts to plot a graph with
>>>> timestamp/matches.
>>>>
>>>> I was thinking just go through all of the matches for now, which
>>>> althrough
>>>> probably is not nearly as efficient as going using lucene type facets,
>>>> would get the trick done temporarily until proper facets are implemented.
>>>>
>>>>   Agreed, is the date field only a date?  Or does it contain timestamps as
>>> well?  What is the range of the dates?  Days?  Weeks?  Months?  Years?
>>>   All
>>> of the above?
>>>
>> To the second... YYYYMMDDHHmmss
>>
>>> The reason I ask is basically, if you are looking at let's say a months
>>> worth and you have a time scope on the date field of days.  Then that's
>>> only 30-31 facets that you will have to add manually to the query.
>>>    Obviously as the time scope and range grows this will get a little too
>>> messy to want to deal with on the client side.  Also you can use the terms
>>> call to get the current terms in a field, so if you want to traverse the
>>> indexed values that can give you that info.
>>>
>> Depends upon the timescale being queried. If the timescale is the past
>> hour, then it would be by minute, if it's over a month, then it would be by
>> hour. For lucene, I just get the facets, and post process them by shrinking
>> the timestamp value down the the level I want.... Such as if I wanted to
>> view hourly counts, I would loop through all of the facet results
>> condensing them down to minute values. Postprocessing the facet results
>> from lucene facets was by far a LOT quicker than going through all of the
>> actual results, which I am betting is probly the case with blur as well.
>> With lucene, facets was what I used the most when trying to present
>> information to GUI interfaces because it makes the most sense when viewing
>> for people.
>>
>>> Just trying to help get you want you need right now.
>>>
>>>
>>>   Currently the blur site lists facets as being something that works
>>>> here...
>>>>
>>>> http://incubator.apache.org/****blur/how_it_works.html<http://incubator.apache.org/**blur/how_it_works.html>
>>>> <http://**incubator.apache.org/blur/how_**it_works.html<http://incubator.apache.org/blur/how_it_works.html>
>>>> But as this thread kinda pointed out, facets the way faceted
>>>> classification describes does not exist right now within apache blur. So
>>>> someone may want to change that to inform that it is currently on the
>>>> todo
>>>> list or something.
>>>>
>>>> http://en.wikipedia.org/wiki/****Faceted_classification<http://en.wikipedia.org/wiki/**Faceted_classification>
>>>> <http:/**/en.wikipedia.org/wiki/**Faceted_classification<http://en.wikipedia.org/wiki/Faceted_classification>
>>>> A great example I use to show people what facets are is the following
>>>> site...
>>>>
>>>> http://www.fasttech.com/****category/1499/consumer-****electronics<http://www.fasttech.com/**category/1499/consumer-**electronics>
>>>> <http://www.**fasttech.com/category/1499/**consumer-electronics<http://www.fasttech.com/category/1499/consumer-electronics>
>>>> On the left side, it is easy to see a breakdown of all the different
>>>> Fields/Values associated with the current search query. My intention is
>>>> to
>>>> display facet data for all (or the important ones anyway) of the fields
>>>> associated with the current query along with a line graph showing the
>>>> count
>>>> of all matching rows for each time interval. Then the query can be
>>>> refined
>>>> more by querying a specific time range, or field.
>>>>
>>>> Is proper facet implementation something that is has a somewhat high
>>>> priority and will hopefully be at least partially implemented within the
>>>> next couple of weeks/months? Or should I just work on processing all the
>>>> results myself for now? Also, I notice the default query matches is only
>>>> 10, and I see no way to specify unlimited. Can I specify -1 for limited
>>>> or
>>>> something like that, or do I need to specify a really large number that
>>>> will always be higher than the number of actual results I am expecting...
>>>> like Long.MAX_VALUE or something?
>>>>
>>> I agree it is a priority, my top priority is getting 0.2.1 out the door.
>>>    But if we can decide on the API changes that need to be made in the
>>> facet
>>> apit we can begin on it in 0.3.0 at any point.  And once 0.2.1 is complete
>>> I will be turning my focus on 0.3.0, I hope to call for a vote for 0.2.1
>>> in
>>> the next week.
>>>
>>> Ok, so for queries you can page through the results.  However the facet
>>> count reflect the entire answer.  You can't ask for all the results back
>>> at
>>> once due to memory on constraints within the system.  But you can set in
>>> the BlurQuery object the start and fetch (which is the number to fetch).
>>>
>>> http://incubator.apache.org/**blur/docs/0.2.0/Blur.html#**
>>> Struct_BlurQuery<http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_BlurQuery>
>>>
>> Hmm... yea, when going through say 100,000,000+ rows to generate a graph,
>> it is no doubt going to take a long time though re-querying in 1,000
>> results intervals 100,000+ times. If that's for only 5 minutes of data,
>> it's a huge amount of processing to see general statistics of the data you
>> have in front of you.
>>
>> This is where facets became vital for me. I understand that right now
>> "facets" in blur are not really facets, they are instead additional queries
>> which get run. Not really sure why it was implemented that way, but when
>> you read the lucene documentation (http://lucene.apache.org/**core/4_3_0/<http://lucene.apache.org/core/4_3_0/>)
>> it links to wiki pages about faceted searches as well as a use guide
>> explaining what facets are, the implementation in blur does not match what
>> everything else defines facets as.
>>
>> I'm not sure who or how facets became to be implemented in the current
>> manor, but it does not make sense at all or comply with all definitions of
>> facets I have found. I find this to be a conflict, if blur advertises them
>> but does not really have them. Since there is no documentation about facets
>> really, other than it saying it's in the feature list, it took me a while
>> to discover this. For me in particular, this is vital. What use is indexing
>> massive amounts of information if you do not have very good visibility of
>> it.
>>
>> As I have mentioned, my use is for storing logged events. Let's say you
>> have events for sshd being stored in a table along with the fields Date,
>> LoginMethod, IP, User, Server, and Success. If you have a LOT servers being
>> monitored which have a lot of user login activity. In lucene I would do a
>> single query against any of those fields, or perhaps just start with
>> matching all records. Along with that query, I would get the facets for
>> those fields using Date to display a time graph of activity for the rows. I
>> would then display the top 5-10 facets for each field along with a subquery
>> that does just a facetquery to display another time graph of the Date
>> facets. With this you can instantly see 10 login failures within 100,000
>> successes, how many times each user has logged in and what methods where
>> used, etc. This is a simple example, but expand that out to all kinds of
>> other information and it's night and day visibility of data.
>>
>> When trying to view data of any kind in an effective manor, graphing
>> always helps, but to process every matching row is obviously inefficient. I
>> believe some of the other systems out there such as splunk do that, but
>> when I did my own work, I found that to slow and inefficient. Sure, it
>> works fine when viewing a small amount of data, but when we are talking
>> about big data, which is what Blur is designed for, and what I am working
>> with, it's just to much overhead. Using facets on date values to produce
>> time graphs of entries no matter how many rows/records you produce pretty
>> much is almost instant.
>>
>> In splunk or other search systems, I would see events populated over time
>> in a graph along with the first page of data. The time graph continues to
>> fill over time showing a timeline of data. Depending on your data, this can
>> take a seriously long time. This is no doubt doing what your suggesting
>> with the processing of data one page at a time, sending it to the browser
>> to parse into data stores that display graphs.
>> With facet results, I was able to display the historical timelines in the
>> same amount of time it took to do a single query along with the facet data.
>> There just is no match from what I have seen so far, for Lucene indexes
>> along with facet indexes, which is what got my so excited about blur. I
>> myself literally was in the design phase of writing my own implementation
>> of a distributed lucene index system when I decided to stop and check what
>> was out there before re-inventing the wheel. When I came across the blur
>> project, I found the feature list and looked at two things primarily which
>> got me into starting to work with the project. Those two things were "Fast
>> data ingestion" and "Facets". So far, data seems to be getting pretty
>> quickly in my virtual box tests, which is good. I am going to be scaling up
>> soon once the new hardware requisition is finished. Facets though is
>> currently stopping me from moving forward on some of the code development
>> which requires facets, which is why I am so interested in it's
>> implementation. With looping through records, it could take minutes to get
>> proper visibility of data, whereas with Facets only a couple seconds if
>> that.
>>
>> While waiting, I am going to probably make that IP field type definition I
>> mentioned earlier, as possibly some additional ones. Most of the code for
>> that seems to make sense, but I'll need to load it up in something other
>> than a text editor to really get an appreciation for it. If some of what
>> needs to be done for facets can be explained, I'll perhaps see if I can
>> dedicate some company time to it.
>>
>>>> Thanks,
>>>> Colton McInroy
>>>>
>>>>    * Director of Security Engineering
>>>>
>>>>
>>>> Phone
>>>> (Toll Free)
>>>> _US_    (888)-818-1344 Press 2
>>>> _UK_    0-800-635-0551 Press 2
>>>>
>>>> My Extension    101
>>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>> Website         http://www.dosarrest.com
>>>>
>>>> On 10/18/2013 8:40 AM, Colton McInroy wrote:
>>>>
>>>>   Hello Aaron,
>>>>>       Yes, that's basically what I was thinking of for the facet results.
>>>>> The current implementation doesn't really make any sense if your coming
>>>>> from lucene. For simplicity and uniformity, I think it should be
>>>>> somewhat
>>>>> like it is with lucene... with adaptation to the way blur is built...
I
>>>>> could kinda see something like this...
>>>>>
>>>>>       public static void queryBlur(String queryString, String table)
{
>>>>>           Iface client = BlurClient.getClient(****
>>>>> mainConfig.getString("**
>>>>> controllers"));
>>>>>           Query query = new Query();
>>>>>           query.setQuery(queryString);
>>>>>
>>>>>           Selector selector = new Selector();
>>>>>
>>>>>           // This will fetch all the columns in family "fam0".
>>>>>           selector.****addToColumnFamiliesToFetch("****event");
>>>>>           selector.****addToColumnFamiliesToFetch("****msg");
>>>>>
>>>>>           BlurQuery blurQuery = new BlurQuery();
>>>>>           int matches = 10;
>>>>>           List<Facet> facets = Arrays.asList(new Facet("field1",
>>>>> matches),new Facet("field2", matches));
>>>>>           blurQuery.setFacets(facets);
>>>>>           blurQuery.setFetch(50);
>>>>>           blurQuery.setQuery(query);
>>>>>           blurQuery.setSelector(****selector);
>>>>>
>>>>>           try {
>>>>>               BlurResults results = client.query(table, blurQuery);
>>>>>               for (Facet facet : result.getFacetResults()) {
>>>>>                   System.out.println(facet.name+****" "+facet.value);
>>>>>               }
>>>>>           } catch (BlurException e) {
>>>>>               // TODO Auto-generated catch block
>>>>>               e.printStackTrace();
>>>>>           } catch (TException e) {
>>>>>               // TODO Auto-generated catch block
>>>>>               e.printStackTrace();
>>>>>           }
>>>>>           return null;
>>>>>       }
>>>>>
>>>>>       Just a brief modification from what I am doing now. Basically I
>>>>> just
>>>>> envision a method called getFacetResults which returns List<Facet>
with
>>>>> each Facet object containing a "name" and a "value" which would be the
>>>>> column name and facet count respectively. I'm just throwing this out
>>>>> there
>>>>> for now. This is a different way of implementing the facets than lucene
>>>>> in
>>>>> terms of how the code is accessed, but it would provide the same
>>>>> results.
>>>>>
>>>>>       It could also be done something like this...
>>>>>
>>>>>       List<Facet> facets = Arrays.asList(new Facet("field1"), new
>>>>> Facet("field2"));
>>>>>       blurQuery.setFacets(facets, matches);
>>>>>
>>>>>       Depends if the number of matches should be per facet or per query,
>>>>> although I see the merits in being able to specify the matches for each
>>>>> field.
>>>>>
>>>>> Thanks,
>>>>> Colton McInroy
>>>>>
>>>>>    * Director of Security Engineering
>>>>>
>>>>>
>>>>> Phone
>>>>> (Toll Free)
>>>>> _US_     (888)-818-1344 Press 2
>>>>> _UK_     0-800-635-0551 Press 2
>>>>>
>>>>> My Extension     101
>>>>> 24/7 Support     support@dosarrest.com <mailto:support@dosarrest.com>
>>>>> Email     colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>> Website     http://www.dosarrest.com
>>>>>
>>>>> On 10/18/2013 5:20 AM, Aaron McCurry wrote:
>>>>>
>>>>>   I have an issue in Jira to document facets in 0.2.1, it's not been
>>>>>> worked
>>>>>> yet but I hope I can get to it soon.  It looks like you figured out
>>>>>> what
>>>>>> is
>>>>>> there.
>>>>>>
>>>>>> We will likely improve facets in 0.3.0 so the API will have to change
a
>>>>>> bit.  The biggest change we will need to make is the scenario that
you
>>>>>> bring up.  Facets in the current implementation case are simply other
>>>>>> queries that can range from a single term to a complex query. I'm
>>>>>> assuming
>>>>>> that you would like to specify a field name and get something like
a
>>>>>> map
>>>>>> of
>>>>>> terms to counts for the given facet?
>>>>>>
>>>>>> The field facetCounts are counts that each of the facets in the input
>>>>>> list
>>>>>> from the query.  So the count list corresponds one for one to the
facet
>>>>>> list in the Query.  I realize this is less than ideal and we can
going
>>>>>> to
>>>>>> be improving it soon.
>>>>>>
>>>>>> If you have some suggestions on how you would want the facet api
to
>>>>>> operate, new features, or anything else for that matter just write
up
>>>>>> your
>>>>>> thoughts on this thread and we can incorporate them into the task.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Aaron
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 18, 2013 at 6:43 AM, Colton McInroy <colton@dosarrest.com
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>    Ok, so I created this method...
>>>>>>
>>>>>>> public static BlurResults queryBlur(String queryString, String
table)
>>>>>>> {
>>>>>>>            Iface client = BlurClient.getClient(****
>>>>>>> mainConfig.getString("**
>>>>>>> controllers"));
>>>>>>>            Query query = new Query();
>>>>>>>            query.setQuery(queryString);
>>>>>>>
>>>>>>>            Selector selector = new Selector();
>>>>>>>
>>>>>>>            // This will fetch all the columns in family "fam0".
>>>>>>>            selector.******addToColumnFamiliesToFetch("******event");
>>>>>>>            selector.******addToColumnFamiliesToFetch("******msg");
>>>>>>>
>>>>>>>            BlurQuery blurQuery = new BlurQuery();
>>>>>>>            List<Facet> facets = Arrays.asList(new Facet(queryString,
>>>>>>> Long.MAX_VALUE));
>>>>>>>            blurQuery.setFacets(facets);
>>>>>>>            blurQuery.setFetch(50);
>>>>>>>            blurQuery.setQuery(query);
>>>>>>>            blurQuery.setSelector(******selector);
>>>>>>>
>>>>>>>            try {
>>>>>>>                BlurResults results = client.query(table, blurQuery);
>>>>>>>                return results;
>>>>>>>            } catch (BlurException e) {
>>>>>>>                // TODO Auto-generated catch block
>>>>>>>                e.printStackTrace();
>>>>>>>            } catch (TException e) {
>>>>>>>                // TODO Auto-generated catch block
>>>>>>>                e.printStackTrace();
>>>>>>>            }
>>>>>>>            return null;
>>>>>>>        }
>>>>>>>
>>>>>>>    From reading through source code, I was able to find out that
you
>>>>>>> specify
>>>>>>> facets as a list, but this is fairly confusing to me coming from
>>>>>>> lucene.
>>>>>>>
>>>>>>> In lucene when getting facet data, I specify the facet fields
I am
>>>>>>> interested in, and the facet results show me a top X list of
values
>>>>>>> within
>>>>>>> that field. Whereas with blur, it appears that a facet is another
>>>>>>> query
>>>>>>> which gives only a number as a result. When I tried to obtain
the
>>>>>>> facet
>>>>>>> data I am used to with Lucene, the only thing I could find was...
>>>>>>>
>>>>>>> System.out.println("Facet Results: "+results.getFacetCountsSize()**
>>>>>>> ****);
>>>>>>> System.out.println(JSONArray.******toJSONString(results.******
>>>>>>> getFacetCounts()));
>>>>>>>
>>>>>>>
>>>>>>> Could you please elaborate on this.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Colton McInroy
>>>>>>>
>>>>>>>     * Director of Security Engineering
>>>>>>>
>>>>>>>
>>>>>>> Phone
>>>>>>> (Toll Free)
>>>>>>> _US_    (888)-818-1344 Press 2
>>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>>
>>>>>>> My Extension    101
>>>>>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>>>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>>>> Website         http://www.dosarrest.com
>>>>>>>
>>>>>>> On 10/18/2013 3:07 AM, Colton McInroy wrote:
>>>>>>>
>>>>>>>    I think I wrote this to soon, I believe I just found out how
to do
>>>>>>> it.
>>>>>>>
>>>>>>>> I'll test it out and supply some example code if correct
to help
>>>>>>>> others.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Colton McInroy
>>>>>>>>
>>>>>>>>     * Director of Security Engineering
>>>>>>>>
>>>>>>>>
>>>>>>>> Phone
>>>>>>>> (Toll Free)
>>>>>>>> _US_     (888)-818-1344 Press 2
>>>>>>>> _UK_     0-800-635-0551 Press 2
>>>>>>>>
>>>>>>>> My Extension     101
>>>>>>>> 24/7 Support     support@dosarrest.com <mailto:support@dosarrest.com
>>>>>>>> Email     colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>>>>> Website     http://www.dosarrest.com
>>>>>>>>
>>>>>>>> On 10/18/2013 2:58 AM, Colton McInroy wrote:
>>>>>>>>
>>>>>>>>    Hey Aaron,
>>>>>>>>
>>>>>>>>>        You mentioned a while ago that blur handles facets
as well and
>>>>>>>>> that
>>>>>>>>> you would provide an example. Unless I have missed that
email, I
>>>>>>>>> haven't
>>>>>>>>> seen an example yet, could you provide one? I just took
a quick look
>>>>>>>>> myself
>>>>>>>>> and could not figure it out. I see there is an example
>>>>>>>>> FacetQueryTest.java
>>>>>>>>> in blur-query but that appears to be basically just a
copy of the
>>>>>>>>> lucene
>>>>>>>>> file.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message