[DOCS] Rewrite analysis intro (#51184)

jrodewig · jrodewig · commit 2e0ed041169d · 2020-01-30T09:32:42.000-05:00
* [DOCS] Rewrite analysis intro. Move index/search analysis content.

* Rewrites 'Text analysis' page intro as high-level definition.
  Adds guidance on when users should configure text analysis
* Rewrites and splits index/search analysis content:
  * Conceptual content -&gt; 'Index and search analysis' under 'Concepts'
  * Task-based content -&gt; 'Specify an analyzer' under 'Configure...'
* Adds detailed examples for when to use the same index/search analyzer
  and when not.
* Adds new example snippets for specifying search analyzers

* clarifications

* Add toc. Decrement headings.

* Reword 'When to configure' section

* Remove sentence from tip
diff --git a/docs/reference/analysis.asciidoc b/docs/reference/analysis.asciidoc
@@ -4,141 +4,40 @@
 [partintro]
 --
 
-_Text analysis_ is the process of converting text, like the body of any email,
-into _tokens_ or _terms_ which are added to the inverted index for searching.
-Analysis is performed by an <<analysis-analyzers,_analyzer_>> which can be
-either a built-in analyzer or a <<analysis-custom-analyzer,`custom`>> analyzer
-defined per index.
+_Text analysis_ is the process of converting unstructured text, like
+the body of an email or a product description, into a structured format that's
+optimized for search.
 
 [float]
-== Index time analysis
+[[when-to-configure-analysis]]
+=== When to configure text analysis
 
-For instance, at index time the built-in <<english-analyzer,`english`>> _analyzer_ 
-will first convert the sentence:
+{es} performs text analysis when indexing or searching <<text,`text`>> fields.
 
-[source,text]
-------
-"The QUICK brown foxes jumped over the lazy dog!"
-------
+If your index doesn't contain `text` fields, no further setup is needed; you can
+skip the pages in this section.
 
-into distinct tokens. It will then lowercase each token, remove frequent
-stopwords ("the") and reduce the terms to their word stems (foxes -> fox,
-jumped -> jump, lazy -> lazi). In the end, the following terms will be added
-to the inverted index:
+However, if you use `text` fields or your text searches aren't returning results
+as expected, configuring text analysis can often help. You should also look into
+analysis configuration if you're using {es} to:
 
-[source,text]
-------
-[ quick, brown, fox, jump, over, lazi, dog ]
-------
+* Build a search engine
+* Mine unstructured data
+* Fine-tune search for a specific language
+* Perform lexicographic or linguistic research
 
 [float]
-[[specify-index-time-analyzer]]
-=== Specifying an index time analyzer
-
-{es} determines which index-time analyzer to use by
-checking the following parameters in order:
-
-. The <<analyzer,`analyzer`>> mapping parameter of the field
-. The `default` analyzer parameter in the index settings
-
-If none of these parameters are specified, the
-<<analysis-standard-analyzer,`standard` analyzer>> is used.
-
-[discrete]
-[[specify-index-time-field-analyzer]]
-==== Specify the index-time analyzer for a field
-
-Each <<text,`text`>> field in a mapping can specify its own
-<<analyzer,`analyzer`>>:
-
-[source,console]
--------------------------
-PUT my_index
-{
-  "mappings": {
-    "properties": {
-      "title": {
-        "type":     "text",
-        "analyzer": "standard"
-      }
-    }
-  }
-}
--------------------------
-
-[discrete]
-[[specify-index-time-default-analyzer]]
-==== Specify a default index-time analyzer
-
-When <<indices-create-index,creating an index>>, you can set a default
-index-time analyzer using the `default` analyzer setting:
-
-[source,console]
-----
-PUT my_index
-{
-  "settings": {
-    "analysis": {
-      "analyzer": {
-        "default": {
-          "type": "whitespace"
-        }
-      }
-    }
-  }
-}
-----
-
-A default index-time analyzer is useful when mapping multiple `text` fields that
-use the same analyzer. It's also used as a general fallback analyzer for both
-index-time and search-time analysis.
-
-[float]
-== Search time analysis
-
-This same analysis process is applied to the query string at search time in
-<<full-text-queries,full text queries>> like the
-<<query-dsl-match-query,`match` query>>
-to convert the text in the query string into terms of the same form as those
-that are stored in the inverted index.
-
-For instance, a user might search for:
-
-[source,text]
-------
-"a quick fox"
-------
-
-which would be analysed by the same `english` analyzer into the following terms:
-
-[source,text]
-------
-[ quick, fox ]
-------
-
-Even though the exact words used in the query string don't appear in the
-original text (`quick` vs `QUICK`, `fox` vs `foxes`), because we have applied
-the same analyzer to both the text and the query string, the terms from the
-query string exactly match the terms from the text in the inverted index,
-which means that this query would match our example document.
-
-[float]
-=== Specifying a search time analyzer
-
-Usually the same analyzer should be used both at
-index time and at search time, and <<full-text-queries,full text queries>>
-like the  <<query-dsl-match-query,`match` query>> will use the mapping to look
-up the analyzer to use for each field.
-
-The analyzer to use to search a particular field is determined by
-looking for:
-
-* An `analyzer` specified in the query itself.
-* The <<search-analyzer,`search_analyzer`>> mapping parameter.
-* The <<analyzer,`analyzer`>> mapping parameter.
-* An analyzer in the index settings called `default_search`.
-* An analyzer in the index settings called `default`.
-* The `standard` analyzer.
+[[analysis-toc]]
+=== In this section
+
+* <<analysis-overview>>
+* <<analysis-concepts>>
+* <<configure-text-analysis>>
+* <<analysis-analyzers>>
+* <<analysis-tokenizers>>
+* <<analysis-tokenfilters>>
+* <<analysis-charfilters>>
+* <<analysis-normalizers>>
 
 --
 
@@ -156,5 +55,4 @@ include::analysis/tokenfilters.asciidoc[]
 
 include::analysis/charfilters.asciidoc[]
 
-include::analysis/normalizers.asciidoc[]
-
+include::analysis/normalizers.asciidoc[]
diff --git a/docs/reference/analysis/concepts.asciidoc b/docs/reference/analysis/concepts.asciidoc
@@ -7,5 +7,7 @@
 This section explains the fundamental concepts of text analysis in {es}.
 
 * <<analyzer-anatomy>>
+* <<analysis-index-search-time>>
 
-include::anatomy.asciidoc[]
+include::anatomy.asciidoc[]
+include::index-search-time.asciidoc[]
diff --git a/docs/reference/analysis/configure-text-analysis.asciidoc b/docs/reference/analysis/configure-text-analysis.asciidoc
@@ -20,10 +20,13 @@ the process.
 * <<test-analyzer>>
 * <<configuring-analyzers>>
 * <<analysis-custom-analyzer>>
+* <specify-analyer>>
 
 
 include::testing.asciidoc[]
 
 include::analyzers/configuring.asciidoc[]
 
-include::analyzers/custom-analyzer.asciidoc[]
+include::analyzers/custom-analyzer.asciidoc[]
+
+include::specify-analyzer.asciidoc[]
diff --git a/docs/reference/analysis/index-search-time.asciidoc b/docs/reference/analysis/index-search-time.asciidoc
@@ -0,0 +1,175 @@
+[[analysis-index-search-time]]
+=== Index and search analysis
+
+Text analysis occurs at two times:
+
+Index time::
+When a document is indexed, any <<text,`text`>> field values are analyzed.
+
+Search time::
+When running a <<full-text-queries,full-text search>> on a `text` field,
+the query string (the text the user is searching for) is analyzed.
++
+Search time is also called _query time_.
+
+The analyzer, or set of analysis rules, used at each time is called the _index
+analyzer_ or _search analyzer_ respectively.
+
+[[analysis-same-index-search-analyzer]]
+==== How the index and search analyzer work together
+
+In most cases, the same analyzer should be used at index and search time. This
+ensures the values and query strings for a field are changed into the same form
+of tokens. In turn, this ensures the tokens match as expected during a search.
+
+.**Example**
+[%collapsible]
+====
+
+A document is indexed with the following value in a `text` field:
+
+[source,text]
+------
+The QUICK brown foxes jumped over the dog!
+------
+
+The index analyzer for the field converts the value into tokens and normalizes
+them. In this case, each of the tokens represents a word:
+
+[source,text]
+------
+[ quick, brown, fox, jump, over, dog ]
+------
+
+These tokens are then indexed.
+
+Later, a user searches the same `text` field for:
+
+[source,text]
+------
+"Quick fox"
+------
+
+The user expects this search to match the sentence indexed earlier,
+`The QUICK brown foxes jumped over the dog!`.
+
+However, the query string does not contain the exact words used in the
+document's original text:
+
+* `quick` vs `QUICK`
+* `fox` vs `foxes`
+
+To account for this, the query string is analyzed using the same analyzer. This
+analyzer produces the following tokens:
+
+[source,text]
+------
+[ quick, fox ]
+------
+
+To execute the serach, {es} compares these query string tokens to the tokens
+indexed in the `text` field.
+
+[options="header"]
+|===
+|Token     | Query string | `text` field
+|`quick`   | X            | X
+|`brown`   |              | X
+|`fox`     | X            | X
+|`jump`    |              | X
+|`over`    |              | X
+|`dog`     |              | X
+|===
+
+Because the field value are query string were analyzed in the same way, they
+created similar tokens. The tokens `quick` and `fox` are exact matches. This
+means the search matches the document containing `"The QUICK brown foxes jumped
+over the dog!"`, just as the user expects.
+====
+
+[[different-analyzers]]
+==== When to use a different search analyzer
+
+While less common, it sometimes makes sense to use different analyzers at index
+and search time. To enable this, {es} allows you to
+<<specify-search-analyzer,specify a separate search analyzer>>.
+
+Generally, a separate search analyzer should only be specified when using the
+same form of tokens for field values and query strings would create unexpected
+or irrelevant search matches.
+
+[[different-analyzer-ex]]
+.*Example*
+[%collapsible]
+====
+{es} is used to create a search engine that matches only words that start with
+a provided prefix. For instance, a search for `tr` should return `tram` or
+`trope`—but never `taxi` or `bat`.
+
+A document is added to the search engine's index; this document contains one
+such word in a `text` field:
+
+[source,text]
+------
+"Apple"
+------
+
+The index analyzer for the field converts the value into tokens and normalizes
+them. In this case, each of the tokens represents a potential prefix for
+the word:
+
+[source,text]
+------
+[ a, ap, app, appl, apple]
+------
+
+These tokens are then indexed.
+
+Later, a user searches the same `text` field for:
+
+[source,text]
+------
+"appli"
+------
+
+The user expects this search to match only words that start with `appli`,
+such as `appliance` or `application`. The search should not match `apple`.
+
+However, if the index analyzer is used to analyze this query string, it would
+produce the following tokens:
+
+[source,text]
+------
+[ a, ap, app, appl, appli ]
+------
+
+When {es} compares these query string tokens to the ones indexed for `apple`,
+it finds several matches.
+
+[options="header"]
+|===
+|Token      | `appli`      | `apple`
+|`a`        | X            | X
+|`ap`       | X            | X
+|`app`      | X            | X
+|`appl`     | X            | X
+|`appli`    |              | X
+|===
+
+This means the search would erroneously match `apple`. Not only that, it would
+match any word starting with `a`.
+
+To fix this, you can specify a different search analyzer for query strings used
+on the `text` field.
+
+In this case, you could specify a search analyzer that produces a single token
+rather than a set of prefixes:
+
+[source,text]
+------
+[ appli ]
+------
+
+This query string token would only match tokens for words that start with
+`appli`, which better aligns with the user's search expectations.
+====
diff --git a/docs/reference/analysis/specify-analyzer.asciidoc b/docs/reference/analysis/specify-analyzer.asciidoc