Keyword marker token filter
Marks specified tokens as keywords, which are not stemmed.
The keyword_marker filter assigns specified tokens a keyword attribute of true. Stemmer token filters, such as stemmer or porter_stem, skip tokens with a keyword attribute of true.
To work properly, the keyword_marker filter must be listed before any stemmer token filters in the analyzer configuration.
The keyword_marker filter uses Lucene’s KeywordMarkerFilter.
To see how the keyword_marker filter works, you first need to produce a token stream containing stemmed tokens.
The following analyze API request uses the stemmer filter to create stemmed tokens for fox running and jumping.
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [ "stemmer" ],
"text": "fox running and jumping"
}
The request produces the following tokens. Note that running was stemmed to run and jumping was stemmed to jump.
[ fox, run, and, jump ]
To prevent jumping from being stemmed, add the keyword_marker filter before the stemmer filter in the previous analyze API request. Specify jumping in the keywords parameter of the keyword_marker filter.
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "keyword_marker",
"keywords": [ "jumping" ]
},
"stemmer"
],
"text": "fox running and jumping"
}
The request produces the following tokens. running is still stemmed to run, but jumping is not stemmed.
[ fox, run, and, jumping ]
To see the keyword attribute for these tokens, add the following arguments to the analyze API request:
explain:trueattributes:keyword
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "keyword_marker",
"keywords": [ "jumping" ]
},
"stemmer"
],
"text": "fox running and jumping",
"explain": true,
"attributes": "keyword"
}
The API returns the following response. Note the jumping token has a keyword attribute of true.
{
"detail": {
"custom_analyzer": true,
"charfilters": [],
"tokenizer": {
"name": "whitespace",
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "running",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "jumping",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3
}
]
},
"tokenfilters": [
{
"name": "__anonymous__keyword_marker",
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0,
"keyword": false
},
{
"token": "running",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1,
"keyword": false
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2,
"keyword": false
},
{
"token": "jumping",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3,
"keyword": true
}
]
},
{
"name": "stemmer",
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0,
"keyword": false
},
{
"token": "run",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1,
"keyword": false
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2,
"keyword": false
},
{
"token": "jumping",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3,
"keyword": true
}
]
}
]
}
}
ignore_case- (Optional, Boolean) If
true, matching for thekeywordsandkeywords_pathparameters ignores letter case. Defaults tofalse. keywords-
(Required*, array of strings) Array of keywords. Tokens that match these keywords are not stemmed.
This parameter,
keywords_path, orkeywords_patternmust be specified. You cannot specify this parameter andkeywords_pattern. keywords_path- (Required*, string) Path to a file that contains a list of keywords. Tokens that match these keywords are not stemmed.
This path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each word in the file must be separated by a line break.
This parameter, keywords, or keywords_pattern must be specified. You cannot specify this parameter and keywords_pattern.
keywords_pattern- (Required*, string) Java regular expression used to match tokens. Tokens that match this expression are marked as keywords and not stemmed.
This parameter, keywords, or keywords_path must be specified. You cannot specify this parameter and keywords or keywords_pattern.
Poorly written regular expressions can cause Elasticsearch to run slowly or result in stack overflow errors, causing the running node to suddenly exit.
To customize the keyword_marker filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.
For example, the following create index API request uses a custom keyword_marker filter and the porter_stem filter to configure a new custom analyzer.
The custom keyword_marker filter marks tokens specified in the analysis/example_word_list.txt file as keywords. The porter_stem filter does not stem these tokens.
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"my_custom_keyword_marker_filter",
"porter_stem"
]
}
},
"filter": {
"my_custom_keyword_marker_filter": {
"type": "keyword_marker",
"keywords_path": "analysis/example_word_list.txt"
}
}
}
}
}