lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gergely Nagy <foge...@gmail.com>
Subject Re: Indexing and searching a DateTime range
Date Thu, 12 Feb 2015 01:03:21 GMT
Thank you Uwe!

Your reply is very useful and insightful. Your workflow matches my
requirements exactly.

My confusion was coming from the fact that I didn't understand what the
Analyzers are doing. Actually I am still wondering, isn't it possible to
provide an abstraction on Lucene side to make the indexing be part of the
Lucene index processing mechanism?

It seems quite odd that I have to do something which is strongly coupled to
Lucene (building the index), outside Lucene.

In any case, I will take your advice. I hope other people will also find
this info useful.

Best regards,
Gergely Nagy

2015-02-10 18:35 GMT+09:00 Uwe Schindler <uwe@thetaphi.de>:

> Hi,
>
> > OK. I found the Alfresco code on GitHub. So it's open source it seems.
> >
> > And I found the DateTimeAnalyser, so I will just take that code as a
> starting
> > point:
> >
> https://github.com/lsbueno/alfresco/tree/master/root/projects/repository/
> > source/java/org/alfresco/repo/search/impl/lucene/analysis
>
> This won't help you:
> a) its outdated code from very early Lucene versions
> b) it would be slow, because it does not use the numeric features of
> Lucene, so your code would be very slow if you search for date ranges
>
> Basically, I don't really understand your problem:
> If you use Lucene directly you are responsible for processing the text
> before it goes into the index. If you want to create a Lucene Document per
> Line, it is your turn to do this. Lucene has no functionality to split
> documents. You have to process your input and bring it into a format that
> Lucene wants: "Documents" consisting of "Key/Value" pairs. Analyzers are
> only there for processing one specific field and tokenize the input (so the
> index contains words and not the whole field as one term). Analyzers have
> nothing to do with Analysis of the structure of Log lines (because they
> would only work on one field, which does not help for structured queries
> like on date).
>
> So basically your indexing workflow is:
>
> - Open Log file
> - Read log file line by line
> - Create a Lucene IndexDocument instance
> - Extract "interesting" key/value pairs from your log file, e.g. by using
> regular expressions (like Logstash does). Basically this would for example
> "detect" the date, class name from Log4J files, or whatever else
> - Put those key/value pairs as fields (numeric, text,...)  to the Lucene
> IndexDocument: One field for the date, one field for message content, one
> field for classname,... (those fields don't need to be stored, unless you
> want to display only them in search results, see below).
> - In addition, it is wise to add an additional Lucene TextField instance
> (that is also STORED=TRUE, INDEXED=TRUE with good Analyzer) that contains
> the whole line (redundant). By STORING it, you are able to return the whole
> log line in your search results
> - Index the document
> - Process next line
>
> If you don't want to write this code on your own, use Logstash and
> Elasticsearch (or write a separate plugin for Logstash that indexes to
> lucene). But your comment is strange: You say: Elasticsearch and Logstah is
> too slow for many log lines. How should then Lucene be faster?
> Elasticsearch also uses Lucene under the hood. The main problem if its slow
> is in most cases incorrect data types while indexing (like using a text
> field for dates and doing ranges). It is the same like indexing a number in
> a relational database as String and then do "like" queries instead of real
> numeric comparisons - just wrong and slow.
>
> Uwe
>
> > Thank you for everybody for the time to respond.
> >
> > 2015-02-10 9:55 GMT+09:00 Gergely Nagy <fogetti@gmail.com>:
> >
> > > Thank you Barry, I really appreciate your time to respond,
> > >
> > > Let me clarify this a little bit more. I think it was not clear.
> > >
> > > I know how to parse dates, this is not the question here. (See my
> > > previous
> > > email: "how can I pipe my converter logic into the indexing process?")
> > >
> > > All of your solutions guys would work fine if I wanted to index
> > > per-document. Which I do NOT want to do. What I would like to do to
> > > index per log line.
> > >
> > > I need to do a full text search, but with the additional requirement
> > > to filter those search hits by DateTime range.
> > >
> > > I hope this makes it clearer. So any suggestions how to do that?
> > >
> > > Sidenote: I saw that Alfresco implemented this analyzer, called
> > > DateTimeAnalyzer, but Alfresco is not open source. So I was wondering
> > > how to implement the same. Actually after wondering for 2 days, I
> > > became convinced that writing an Analyzer should be the way to go. I
> > > will post my solution later if I have a working code.
> > >
> > > 2015-02-10 8:50 GMT+09:00 Barry Coughlan <b.coughlan2@gmail.com>:
> > >
> > >> Hi Gergely,
> > >>
> > >> Writing an analyzer would work but it is unnecessarily complicated.
> > >> You could just parse the date from the string in your input code and
> > >> index it in the LongField like this:
> > >>
> > >> SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd
> > >> HH:mm:ss.S'Z'"); format.setTimeZone(TimeZone.getTimeZone("UTC"));
> > >> long t = format.parse("2015-02-08 00:02:06.123Z INFO...").getTime();
> > >>
> > >> Barry
> > >>
> > >> On Tue, Feb 10, 2015 at 12:21 AM, Gergely Nagy <fogetti@gmail.com>
> > wrote:
> > >>
> > >> > Thank you for taking your time to respond Karthik,
> > >> >
> > >> > Can you show me an example how to convert DateTime to milliseconds?
> > >> > I
> > >> mean
> > >> > how can I pipe my converter logic into the indexing process?
> > >> >
> > >> > I suspect I need to write my own Analyzer/Tokenizer to achieve
> > >> > this. Is this correct?
> > >> >
> > >> > 2015-02-09 22:58 GMT+09:00 KARTHIK SHIVAKUMAR
> > <nskarthik.k@gmail.com>:
> > >> >
> > >> > > Hi
> > >> > >
> > >> > > Long time ago,.. I used to store datetime in millisecond .
> > >> > >
> > >> > > TermRangequery used to work in perfect condition....
> > >> > >
> > >> > > Convert all datetime to millisecond and index the same.
> > >> > >
> > >> > > On search condition again convert datetime to millisecond and
use
> > >> > > TermRangequery.
> > >> > >
> > >> > > With regards
> > >> > > Karthik
> > >> > > On Feb 9, 2015 1:24 PM, "Gergely Nagy" <fogetti@gmail.com>
wrote:
> > >> > >
> > >> > > > Hi Lucene users,
> > >> > > >
> > >> > > > I am in the beginning of implementing a Lucene application
> > >> > > > which
> > >> would
> > >> > > > supposedly search through some log files.
> > >> > > >
> > >> > > > One of the requirements is to return results between a time
> range.
> > >> > Let's
> > >> > > > say these are two lines in a series of log files:
> > >> > > > 2015-02-08 00:02:06.852Z INFO...
> > >> > > > ...
> > >> > > > 2015-02-08 18:02:04.012Z INFO...
> > >> > > >
> > >> > > > Now I need to search for these lines and return all the
text
> > >> > in-between.
> > >> > > I
> > >> > > > was using this demo application to build an index:
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >> http://lucene.apache.org/core/4_10_3/demo/src-
> > html/org/apache/lucene/
> > >> demo/IndexFiles.html
> > >> > > >
> > >> > > > After that my first thought was using a term range query
like
> this:
> > >> > > >         TermRangeQuery query =
> > >> > TermRangeQuery.newStringRange("contents",
> > >> > > > "2015-02-08 00:02:06.852Z", "2015-02-08 18:02:04.012Z",
true,
> > >> > > > true);
> > >> > > >
> > >> > > > But for some reason this didn't return any results.
> > >> > > >
> > >> > > > Then I was Googling for a while how to solve this problem,
but
> > >> > > > all
> > >> the
> > >> > > > datetime examples I found are searching based on a much
simpler
> > >> field.
> > >> > > > Those examples usually use a field like this:
> > >> > > > doc.add(new LongField("modified", file.lastModified(),
> > >> Field.Store.NO
> > >> > ));
> > >> > > >
> > >> > > > So I was wondering, how can I index these log files to make
a
> > >> > > > range
> > >> > query
> > >> > > > work on them? Any ideas? Maybe my approach is completely
> > wrong.
> > >> > > > I am
> > >> > > still
> > >> > > > new to Lucene so any help is appreciated.
> > >> > > >
> > >> > > > Thank you.
> > >> > > >
> > >> > > > Gergely Nagy
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message