|
| 1 | + |
| 2 | +== Text analysis overview |
| 3 | +++++ |
| 4 | +<titleabbrev>Overview</titleabbrev> |
| 5 | +++++ |
| 6 | + |
| 7 | +Text analysis enables {es} to perform full-text search, where the search returns |
| 8 | +all _relevant_ results rather than just exact matches. |
| 9 | + |
| 10 | +If you search for `Quick fox jumps`, you probably want the document that |
| 11 | +contains `A quick brown fox jumps over the lazy dog`, and you might also want |
| 12 | +documents that contain related words like `fast fox` or `foxes leap`. |
| 13 | + |
| 14 | +[discrete] |
| 15 | +[[tokenization]] |
| 16 | +=== Tokenization |
| 17 | + |
| 18 | +Analysis makes full-text search possible through _tokenization_: breaking a text |
| 19 | +down into smaller chunks, called _tokens_. In most cases, these tokens are |
| 20 | +individual words. |
| 21 | + |
| 22 | +If you index the phrase `the quick brown fox jumps` as a single string and the |
| 23 | +user searches for `quick fox`, it isn't considered a match. However, if you |
| 24 | +tokenize the phrase and index each word separately, the terms in the query |
| 25 | +string can be looked up individually. This means they can be matched by searches |
| 26 | +for `quick fox`, `fox brown`, or other variations. |
| 27 | + |
| 28 | +[discrete] |
| 29 | +[[normalization]] |
| 30 | +=== Normalization |
| 31 | + |
| 32 | +Tokenization enables matching on individual terms, but each token is still |
| 33 | +matched literally. This means: |
| 34 | + |
| 35 | +* A search for `Quick` would not match `quick`, even though you likely want |
| 36 | +either term to match the other |
| 37 | + |
| 38 | +* Although `fox` and `foxes` share the same root word, a search for `foxes` |
| 39 | +would not match `fox` or vice versa. |
| 40 | + |
| 41 | +* A search for `jumps` would not match `leaps`. While they don't share a root |
| 42 | +word, they are synonyms and have a similar meaning. |
| 43 | + |
| 44 | +To solve these problems, text analysis can _normalize_ these tokens into a |
| 45 | +standard format. This allows you to match tokens that are not exactly the same |
| 46 | +as the search terms, but similar enough to still be relevant. For example: |
| 47 | + |
| 48 | +* `Quick` can be lowercased: `quick`. |
| 49 | + |
| 50 | +* `foxes` can be _stemmed_, or reduced to its root word: `fox`. |
| 51 | + |
| 52 | +* `jump` and `leap` are synonyms and can be indexed as a single word: `jump`. |
| 53 | + |
| 54 | +To ensure search terms match these words as intended, you can apply the same |
| 55 | +tokenization and normalization rules to the query string. For example, a search |
| 56 | +for `Foxes leap` can be normalized to a search for `fox jump`. |
| 57 | + |
| 58 | +[discrete] |
| 59 | +[[analysis-customization]] |
| 60 | +=== Customize text analysis |
| 61 | + |
| 62 | +Text analysis is performed by an <<analyzer-anatomy,_analyzer_>>, a set of rules |
| 63 | +that govern the entire process. |
| 64 | + |
| 65 | +{es} includes a default analyzer, called the |
| 66 | +<<analysis-standard-analyzer,standard analyzer>>, which works well for most use |
| 67 | +cases right out of the box. |
| 68 | + |
| 69 | +If you want to tailor your search experience, you can choose a different |
| 70 | +<<analysis-analyzers,built-in analyzer>> or even |
| 71 | +<<analysis-custom-analyzer,configure a custom one>>. A custom analyzer gives you |
| 72 | +control over each step of the analysis process, including: |
| 73 | + |
| 74 | +* Changes to the text _before_ tokenization |
| 75 | + |
| 76 | +* How text is converted to tokens |
| 77 | + |
| 78 | +* Normalization changes made to tokens before indexing or search |
0 commit comments