In general, indexing word-ngrams with things like Shingles or CommonGrams is just a tradeoff (fairly expert), to reduce the cost of positional queries or to enhance phrase scoring. Shingles are often used to help speed up phrase queries, such as matchphrase. day and filtered in the day of analysis using 0.45 micron Teflon filters. You can use the shingle filter to add two-word shingles to this stream: the, the lazy, lazy, lazy dog, dog. recycled asphalt shingles (RAS) and reclaimed asphalt pavements (RAP) have not. For example, many tokenizers convert the lazy dog to the, lazy, dog. for a phrase of "foo bar baz" with shingles of size 2, you will have two tokens: foo_bar, bar_baz and you could implement the search via some of lucene's other queries (like BooleanQuery) for an inexact approximation. By default, the shingle token filter outputs two-word shingles and unigrams. No obvious way to compute "sloppy phrase queries" or inexact phrase matches, although this can be approximated, e.g.Some additional cost during the analysis phase of indexing: although ShingleFilter is optimized nicely and is pretty fast.Increased term dictionary, term index, and postings list sizes, though this might be a fair tradeoff especially if you completely disable positions entirely with tIndexOptions.And since its now a "real term", the phrase IDF will be exact, because we know exactly how many documents this "term" exists.īut using shingles has some costs as well: This means for this phrase query, it will be parsed as a simple TermQuery, without using any positions lists. On the other hand, if you use shingles, you are also indexing word n-grams, in other words, if you are shingling up to size 2, you will also have terms like "foo bar" in the index. So instead this is approximated based on the sum of the term IDFs. Because only individual terms appear in the inverted index, there is no real "phrase IDF" computed (this might not affect you).Positions (.prx) must be indexed and searched, this is like an additional "dimension" to the inverted index which will increase indexing and search times My requirement is to have a token filter which can produce the tokens as below.This has some cost to both performance and scoring: When using phrase queries (say "foo bar") in the typical case where single words are in the index, phrase queries have to walk the inverted index for "foo" and for "bar" and find the documents that contain both terms, then walk their positions lists within each one of those documents to find the places where "foo" appeared right before "bar". The differences between using phrase versus shingle mainly involve performance and scoring.
0 Comments
Leave a Reply. |