incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis.gospodne...@gmail.com>
Subject Re: Facets
Date Fri, 25 Oct 2013 19:20:46 GMT
I only skimmed this thread. Nobody seems to have mentioned Lucene's own
faceting, which merits looking into.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Oct 22, 2013 1:56 AM, "Colton McInroy" <colton@dosarrest.com> wrote:

>
> Thanks,
> Colton McInroy
>
>  * Director of Security Engineering
>
>
> Phone
> (Toll Free)
> _US_    (888)-818-1344 Press 2
> _UK_    0-800-635-0551 Press 2
>
> My Extension    101
> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
> Website         http://www.dosarrest.com
>
> On 10/21/2013 5:40 PM, Aaron McCurry wrote:
>
>> On Mon, Oct 21, 2013 at 4:45 PM, Colton McInroy <colton@dosarrest.com
>> >wrote:
>>
>>  You have any suggestions on how I should deal with needing this type of
>>> information in the mean time?...
>>>
>>> Typically what I used facet data for was to generate graph data. Instead
>>> of having to go through every match, group it by time, count them up
>>> manually, etc, I would get facet data for timestamps. For instance, I
>>> create a query which says "field1:value" I would then have grabbed the
>>> facets for the Date field use the facet counts to plot a graph with
>>> timestamp/matches.
>>>
>>> I was thinking just go through all of the matches for now, which
>>> althrough
>>> probably is not nearly as efficient as going using lucene type facets,
>>> would get the trick done temporarily until proper facets are implemented.
>>>
>>>  Agreed, is the date field only a date?  Or does it contain timestamps as
>> well?  What is the range of the dates?  Days?  Weeks?  Months?  Years?
>>  All
>> of the above?
>>
> To the second... YYYYMMDDHHmmss
>
>>
>> The reason I ask is basically, if you are looking at let's say a months
>> worth and you have a time scope on the date field of days.  Then that's
>> only 30-31 facets that you will have to add manually to the query.
>>   Obviously as the time scope and range grows this will get a little too
>> messy to want to deal with on the client side.  Also you can use the terms
>> call to get the current terms in a field, so if you want to traverse the
>> indexed values that can give you that info.
>>
> Depends upon the timescale being queried. If the timescale is the past
> hour, then it would be by minute, if it's over a month, then it would be by
> hour. For lucene, I just get the facets, and post process them by shrinking
> the timestamp value down the the level I want.... Such as if I wanted to
> view hourly counts, I would loop through all of the facet results
> condensing them down to minute values. Postprocessing the facet results
> from lucene facets was by far a LOT quicker than going through all of the
> actual results, which I am betting is probly the case with blur as well.
> With lucene, facets was what I used the most when trying to present
> information to GUI interfaces because it makes the most sense when viewing
> for people.
>
>>
>> Just trying to help get you want you need right now.
>>
>>
>>  Currently the blur site lists facets as being something that works
>>> here...
>>>
>>> http://incubator.apache.org/****blur/how_it_works.html<http://incubator.apache.org/**blur/how_it_works.html>
>>> <http://**incubator.apache.org/blur/how_**it_works.html<http://incubator.apache.org/blur/how_it_works.html>
>>> >
>>>
>>> But as this thread kinda pointed out, facets the way faceted
>>> classification describes does not exist right now within apache blur. So
>>> someone may want to change that to inform that it is currently on the
>>> todo
>>> list or something.
>>>
>>> http://en.wikipedia.org/wiki/****Faceted_classification<http://en.wikipedia.org/wiki/**Faceted_classification>
>>> <http:/**/en.wikipedia.org/wiki/**Faceted_classification<http://en.wikipedia.org/wiki/Faceted_classification>
>>> >
>>>
>>> A great example I use to show people what facets are is the following
>>> site...
>>>
>>> http://www.fasttech.com/****category/1499/consumer-****electronics<http://www.fasttech.com/**category/1499/consumer-**electronics>
>>> <http://www.**fasttech.com/category/1499/**consumer-electronics<http://www.fasttech.com/category/1499/consumer-electronics>
>>> >
>>>
>>> On the left side, it is easy to see a breakdown of all the different
>>> Fields/Values associated with the current search query. My intention is
>>> to
>>> display facet data for all (or the important ones anyway) of the fields
>>> associated with the current query along with a line graph showing the
>>> count
>>> of all matching rows for each time interval. Then the query can be
>>> refined
>>> more by querying a specific time range, or field.
>>>
>>> Is proper facet implementation something that is has a somewhat high
>>> priority and will hopefully be at least partially implemented within the
>>> next couple of weeks/months? Or should I just work on processing all the
>>> results myself for now? Also, I notice the default query matches is only
>>> 10, and I see no way to specify unlimited. Can I specify -1 for limited
>>> or
>>> something like that, or do I need to specify a really large number that
>>> will always be higher than the number of actual results I am expecting...
>>> like Long.MAX_VALUE or something?
>>>
>>
>> I agree it is a priority, my top priority is getting 0.2.1 out the door.
>>   But if we can decide on the API changes that need to be made in the
>> facet
>> apit we can begin on it in 0.3.0 at any point.  And once 0.2.1 is complete
>> I will be turning my focus on 0.3.0, I hope to call for a vote for 0.2.1
>> in
>> the next week.
>>
>> Ok, so for queries you can page through the results.  However the facet
>> count reflect the entire answer.  You can't ask for all the results back
>> at
>> once due to memory on constraints within the system.  But you can set in
>> the BlurQuery object the start and fetch (which is the number to fetch).
>>
>> http://incubator.apache.org/**blur/docs/0.2.0/Blur.html#**
>> Struct_BlurQuery<http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_BlurQuery>
>>
> Hmm... yea, when going through say 100,000,000+ rows to generate a graph,
> it is no doubt going to take a long time though re-querying in 1,000
> results intervals 100,000+ times. If that's for only 5 minutes of data,
> it's a huge amount of processing to see general statistics of the data you
> have in front of you.
>
> This is where facets became vital for me. I understand that right now
> "facets" in blur are not really facets, they are instead additional queries
> which get run. Not really sure why it was implemented that way, but when
> you read the lucene documentation (http://lucene.apache.org/**core/4_3_0/<http://lucene.apache.org/core/4_3_0/>)
> it links to wiki pages about faceted searches as well as a use guide
> explaining what facets are, the implementation in blur does not match what
> everything else defines facets as.
>
> I'm not sure who or how facets became to be implemented in the current
> manor, but it does not make sense at all or comply with all definitions of
> facets I have found. I find this to be a conflict, if blur advertises them
> but does not really have them. Since there is no documentation about facets
> really, other than it saying it's in the feature list, it took me a while
> to discover this. For me in particular, this is vital. What use is indexing
> massive amounts of information if you do not have very good visibility of
> it.
>
> As I have mentioned, my use is for storing logged events. Let's say you
> have events for sshd being stored in a table along with the fields Date,
> LoginMethod, IP, User, Server, and Success. If you have a LOT servers being
> monitored which have a lot of user login activity. In lucene I would do a
> single query against any of those fields, or perhaps just start with
> matching all records. Along with that query, I would get the facets for
> those fields using Date to display a time graph of activity for the rows. I
> would then display the top 5-10 facets for each field along with a subquery
> that does just a facetquery to display another time graph of the Date
> facets. With this you can instantly see 10 login failures within 100,000
> successes, how many times each user has logged in and what methods where
> used, etc. This is a simple example, but expand that out to all kinds of
> other information and it's night and day visibility of data.
>
> When trying to view data of any kind in an effective manor, graphing
> always helps, but to process every matching row is obviously inefficient. I
> believe some of the other systems out there such as splunk do that, but
> when I did my own work, I found that to slow and inefficient. Sure, it
> works fine when viewing a small amount of data, but when we are talking
> about big data, which is what Blur is designed for, and what I am working
> with, it's just to much overhead. Using facets on date values to produce
> time graphs of entries no matter how many rows/records you produce pretty
> much is almost instant.
>
> In splunk or other search systems, I would see events populated over time
> in a graph along with the first page of data. The time graph continues to
> fill over time showing a timeline of data. Depending on your data, this can
> take a seriously long time. This is no doubt doing what your suggesting
> with the processing of data one page at a time, sending it to the browser
> to parse into data stores that display graphs.
> With facet results, I was able to display the historical timelines in the
> same amount of time it took to do a single query along with the facet data.
> There just is no match from what I have seen so far, for Lucene indexes
> along with facet indexes, which is what got my so excited about blur. I
> myself literally was in the design phase of writing my own implementation
> of a distributed lucene index system when I decided to stop and check what
> was out there before re-inventing the wheel. When I came across the blur
> project, I found the feature list and looked at two things primarily which
> got me into starting to work with the project. Those two things were "Fast
> data ingestion" and "Facets". So far, data seems to be getting pretty
> quickly in my virtual box tests, which is good. I am going to be scaling up
> soon once the new hardware requisition is finished. Facets though is
> currently stopping me from moving forward on some of the code development
> which requires facets, which is why I am so interested in it's
> implementation. With looping through records, it could take minutes to get
> proper visibility of data, whereas with Facets only a couple seconds if
> that.
>
> While waiting, I am going to probably make that IP field type definition I
> mentioned earlier, as possibly some additional ones. Most of the code for
> that seems to make sense, but I'll need to load it up in something other
> than a text editor to really get an appreciation for it. If some of what
> needs to be done for facets can be explained, I'll perhaps see if I can
> dedicate some company time to it.
>
>>
>>> Thanks,
>>> Colton McInroy
>>>
>>>   * Director of Security Engineering
>>>
>>>
>>> Phone
>>> (Toll Free)
>>> _US_    (888)-818-1344 Press 2
>>> _UK_    0-800-635-0551 Press 2
>>>
>>> My Extension    101
>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>> Website         http://www.dosarrest.com
>>>
>>> On 10/18/2013 8:40 AM, Colton McInroy wrote:
>>>
>>>  Hello Aaron,
>>>>
>>>>      Yes, that's basically what I was thinking of for the facet results.
>>>> The current implementation doesn't really make any sense if your coming
>>>> from lucene. For simplicity and uniformity, I think it should be
>>>> somewhat
>>>> like it is with lucene... with adaptation to the way blur is built... I
>>>> could kinda see something like this...
>>>>
>>>>      public static void queryBlur(String queryString, String table) {
>>>>          Iface client = BlurClient.getClient(****
>>>> mainConfig.getString("**
>>>> controllers"));
>>>>          Query query = new Query();
>>>>          query.setQuery(queryString);
>>>>
>>>>          Selector selector = new Selector();
>>>>
>>>>          // This will fetch all the columns in family "fam0".
>>>>          selector.****addToColumnFamiliesToFetch("****event");
>>>>          selector.****addToColumnFamiliesToFetch("****msg");
>>>>
>>>>          BlurQuery blurQuery = new BlurQuery();
>>>>          int matches = 10;
>>>>          List<Facet> facets = Arrays.asList(new Facet("field1",
>>>> matches),new Facet("field2", matches));
>>>>          blurQuery.setFacets(facets);
>>>>          blurQuery.setFetch(50);
>>>>          blurQuery.setQuery(query);
>>>>          blurQuery.setSelector(****selector);
>>>>
>>>>          try {
>>>>              BlurResults results = client.query(table, blurQuery);
>>>>              for (Facet facet : result.getFacetResults()) {
>>>>                  System.out.println(facet.name+****" "+facet.value);
>>>>              }
>>>>          } catch (BlurException e) {
>>>>              // TODO Auto-generated catch block
>>>>              e.printStackTrace();
>>>>          } catch (TException e) {
>>>>              // TODO Auto-generated catch block
>>>>              e.printStackTrace();
>>>>          }
>>>>          return null;
>>>>      }
>>>>
>>>>      Just a brief modification from what I am doing now. Basically I
>>>> just
>>>> envision a method called getFacetResults which returns List<Facet>
with
>>>> each Facet object containing a "name" and a "value" which would be the
>>>> column name and facet count respectively. I'm just throwing this out
>>>> there
>>>> for now. This is a different way of implementing the facets than lucene
>>>> in
>>>> terms of how the code is accessed, but it would provide the same
>>>> results.
>>>>
>>>>      It could also be done something like this...
>>>>
>>>>      List<Facet> facets = Arrays.asList(new Facet("field1"), new
>>>> Facet("field2"));
>>>>      blurQuery.setFacets(facets, matches);
>>>>
>>>>      Depends if the number of matches should be per facet or per query,
>>>> although I see the merits in being able to specify the matches for each
>>>> field.
>>>>
>>>> Thanks,
>>>> Colton McInroy
>>>>
>>>>   * Director of Security Engineering
>>>>
>>>>
>>>> Phone
>>>> (Toll Free)
>>>> _US_     (888)-818-1344 Press 2
>>>> _UK_     0-800-635-0551 Press 2
>>>>
>>>> My Extension     101
>>>> 24/7 Support     support@dosarrest.com <mailto:support@dosarrest.com>
>>>> Email     colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>> Website     http://www.dosarrest.com
>>>>
>>>> On 10/18/2013 5:20 AM, Aaron McCurry wrote:
>>>>
>>>>  I have an issue in Jira to document facets in 0.2.1, it's not been
>>>>> worked
>>>>> yet but I hope I can get to it soon.  It looks like you figured out
>>>>> what
>>>>> is
>>>>> there.
>>>>>
>>>>> We will likely improve facets in 0.3.0 so the API will have to change
a
>>>>> bit.  The biggest change we will need to make is the scenario that you
>>>>> bring up.  Facets in the current implementation case are simply other
>>>>> queries that can range from a single term to a complex query. I'm
>>>>> assuming
>>>>> that you would like to specify a field name and get something like a
>>>>> map
>>>>> of
>>>>> terms to counts for the given facet?
>>>>>
>>>>> The field facetCounts are counts that each of the facets in the input
>>>>> list
>>>>> from the query.  So the count list corresponds one for one to the facet
>>>>> list in the Query.  I realize this is less than ideal and we can going
>>>>> to
>>>>> be improving it soon.
>>>>>
>>>>> If you have some suggestions on how you would want the facet api to
>>>>> operate, new features, or anything else for that matter just write up
>>>>> your
>>>>> thoughts on this thread and we can incorporate them into the task.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Aaron
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 18, 2013 at 6:43 AM, Colton McInroy <colton@dosarrest.com
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>   Ok, so I created this method...
>>>>>
>>>>>> public static BlurResults queryBlur(String queryString, String table)
>>>>>> {
>>>>>>           Iface client = BlurClient.getClient(****
>>>>>> mainConfig.getString("**
>>>>>> controllers"));
>>>>>>           Query query = new Query();
>>>>>>           query.setQuery(queryString);
>>>>>>
>>>>>>           Selector selector = new Selector();
>>>>>>
>>>>>>           // This will fetch all the columns in family "fam0".
>>>>>>           selector.******addToColumnFamiliesToFetch("******event");
>>>>>>           selector.******addToColumnFamiliesToFetch("******msg");
>>>>>>
>>>>>>           BlurQuery blurQuery = new BlurQuery();
>>>>>>           List<Facet> facets = Arrays.asList(new Facet(queryString,
>>>>>> Long.MAX_VALUE));
>>>>>>           blurQuery.setFacets(facets);
>>>>>>           blurQuery.setFetch(50);
>>>>>>           blurQuery.setQuery(query);
>>>>>>           blurQuery.setSelector(******selector);
>>>>>>
>>>>>>           try {
>>>>>>               BlurResults results = client.query(table, blurQuery);
>>>>>>               return results;
>>>>>>           } catch (BlurException e) {
>>>>>>               // TODO Auto-generated catch block
>>>>>>               e.printStackTrace();
>>>>>>           } catch (TException e) {
>>>>>>               // TODO Auto-generated catch block
>>>>>>               e.printStackTrace();
>>>>>>           }
>>>>>>           return null;
>>>>>>       }
>>>>>>
>>>>>>   From reading through source code, I was able to find out that you
>>>>>> specify
>>>>>> facets as a list, but this is fairly confusing to me coming from
>>>>>> lucene.
>>>>>>
>>>>>> In lucene when getting facet data, I specify the facet fields I am
>>>>>> interested in, and the facet results show me a top X list of values
>>>>>> within
>>>>>> that field. Whereas with blur, it appears that a facet is another
>>>>>> query
>>>>>> which gives only a number as a result. When I tried to obtain the
>>>>>> facet
>>>>>> data I am used to with Lucene, the only thing I could find was...
>>>>>>
>>>>>> System.out.println("Facet Results: "+results.getFacetCountsSize()**
>>>>>> ****);
>>>>>> System.out.println(JSONArray.******toJSONString(results.******
>>>>>> getFacetCounts()));
>>>>>>
>>>>>>
>>>>>> Could you please elaborate on this.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Colton McInroy
>>>>>>
>>>>>>    * Director of Security Engineering
>>>>>>
>>>>>>
>>>>>> Phone
>>>>>> (Toll Free)
>>>>>> _US_    (888)-818-1344 Press 2
>>>>>> _UK_    0-800-635-0551 Press 2
>>>>>>
>>>>>> My Extension    101
>>>>>> 24/7 Support    support@dosarrest.com <mailto:support@dosarrest.com>
>>>>>> Email   colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>>> Website         http://www.dosarrest.com
>>>>>>
>>>>>> On 10/18/2013 3:07 AM, Colton McInroy wrote:
>>>>>>
>>>>>>   I think I wrote this to soon, I believe I just found out how to
do
>>>>>> it.
>>>>>>
>>>>>>> I'll test it out and supply some example code if correct to help
>>>>>>> others.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Colton McInroy
>>>>>>>
>>>>>>>    * Director of Security Engineering
>>>>>>>
>>>>>>>
>>>>>>> Phone
>>>>>>> (Toll Free)
>>>>>>> _US_     (888)-818-1344 Press 2
>>>>>>> _UK_     0-800-635-0551 Press 2
>>>>>>>
>>>>>>> My Extension     101
>>>>>>> 24/7 Support     support@dosarrest.com <mailto:support@dosarrest.com
>>>>>>> >
>>>>>>> Email     colton@dosarrest.com <mailto:colton@dosarrest.com>
>>>>>>> Website     http://www.dosarrest.com
>>>>>>>
>>>>>>> On 10/18/2013 2:58 AM, Colton McInroy wrote:
>>>>>>>
>>>>>>>   Hey Aaron,
>>>>>>>
>>>>>>>>       You mentioned a while ago that blur handles facets
as well and
>>>>>>>> that
>>>>>>>> you would provide an example. Unless I have missed that email,
I
>>>>>>>> haven't
>>>>>>>> seen an example yet, could you provide one? I just took a
quick look
>>>>>>>> myself
>>>>>>>> and could not figure it out. I see there is an example
>>>>>>>> FacetQueryTest.java
>>>>>>>> in blur-query but that appears to be basically just a copy
of the
>>>>>>>> lucene
>>>>>>>> file.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message