Class TruncateTokenFilterFactory


public class TruncateTokenFilterFactory extends TokenFilterFactory
Factory for TruncateTokenFilter.

Fixed prefix truncation, as a stemming method, produces good results on Turkish language. It is reported that F5, using first 5 characters, produced best results in Information Retrieval on Turkish Texts

Since Lucene 10.5, the filter correctly handles codepoints and truncates after truncateAfterCodePoints codepoints, no longer producing incomplete surrogate pairs. For backwards compatibility the old prefixLength is still supported and its behaviour depends on the luceneMatchVersion parameter. If no parameter is given, it uses a prefix length of 5. In case you change to the more modern codepoint behaviour, reindexing may be required if your documents contain surrogate pairs (like emojis).

The following type is recommended for "diacritics-insensitive search" for Turkish:

 <fieldType name="text_tr_ascii_f5" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.ApostropheFilterFactory"/>
     <filter class="solr.TurkishLowerCaseFilterFactory"/>
     <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
     <filter class="solr.KeywordRepeatFilterFactory"/>
     <filter class="solr.TruncateTokenFilterFactory" truncateAfterCodePoints="5"/>
     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>
 </fieldType>
Since:
4.8.0
SPI Name (case-insensitive: if the name is 'htmlStrip', 'htmlstrip' can be used when looking up the service).
"truncate"