Custom tokenizers

Tokenizers control how text is split into individual terms for indexing and searching. TNTSearch ships with several built-in tokenizers and allows you to create custom ones for specialized use cases.

Default tokenizer

The default Tokenizer splits text on non-letter, non-number characters while preserving Unicode support:

use TeamTNT\TNTSearch\Support\Tokenizer;

$tokenizer = new Tokenizer;
$tokens = $tokenizer->tokenize("This is some text");
// ["this", "is", "some", "text"]

$tokens = $tokenizer->tokenize("Superman (1941)");
// ["superman", "1941"]

// Unicode support
$tokens = $tokenizer->tokenize("das ist gut");
// ["das", "ist", "gut"]

The default pattern is /[^\p{L}\p{N}\p{Pc}\p{Pd}@]+/u, which splits on anything that isn't a letter, number, connector punctuation, dash, or @ symbol.

Built-in n-gram tokenizers

N-gram tokenizers break text into overlapping character sequences. They're useful for typo tolerance, substring matching, and languages without clear word boundaries.

BigramTokenizer (2-gram)

use TeamTNT\TNTSearch\Support\BigramTokenizer;

$tokenizer = new BigramTokenizer;
$tokens = $tokenizer->tokenize("search");
// ["se", "ea", "ar", "rc", "ch"]

TrigramTokenizer (3-gram)

use TeamTNT\TNTSearch\Support\TrigramTokenizer;

$tokenizer = new TrigramTokenizer;
$tokens = $tokenizer->tokenize("search");
// ["sea", "ear", "arc", "rch"]

FourgramTokenizer (4-gram)

use TeamTNT\TNTSearch\Support\FourgramTokenizer;

FivegramTokenizer (5-gram)

use TeamTNT\TNTSearch\Support\FivegramTokenizer;

NGramTokenizer (configurable)

A general n-gram tokenizer with configurable min and max gram sizes:

use TeamTNT\TNTSearch\Support\NGramTokenizer;

$tokenizer = new NGramTokenizer;
// Default: 3-gram

EdgeNgramTokenizer

Generates n-grams from word edges, useful for prefix-based autocomplete:

use TeamTNT\TNTSearch\Support\EdgeNgramTokenizer;

$tokenizer = new EdgeNgramTokenizer;

ProductTokenizer

Creates a cartesian product of terms:

use TeamTNT\TNTSearch\Support\ProductTokenizer;

Setting a tokenizer

At configuration time

$tnt->loadConfig([
    'tokenizer' => \TeamTNT\TNTSearch\Support\TrigramTokenizer::class,
    // ... other config
]);

At index creation time

$indexer = $tnt->createIndex('articles.index');
$indexer->setTokenizer(new TrigramTokenizer);
$indexer->query('SELECT id, title, body FROM articles;');
$indexer->run();

The tokenizer is stored in the index metadata and automatically loaded at search time.

Creating a custom tokenizer

To create your own tokenizer, extend AbstractTokenizer and implement TokenizerInterface:

use TeamTNT\TNTSearch\Support\AbstractTokenizer;
use TeamTNT\TNTSearch\Support\TokenizerInterface;

class ProductCodeTokenizer extends AbstractTokenizer implements TokenizerInterface
{
    // Pattern to split on: spaces, commas, and periods
    protected static $pattern = '/[\s,\.]+/';

    public function tokenize($text, $stopwords = [])
    {
        return preg_split(
            $this->getPattern(),
            mb_strtolower($text),
            -1,
            PREG_SPLIT_NO_EMPTY
        );
    }
}

Then use it when creating your index:

$indexer = $tnt->createIndex('products.index');
$indexer->setTokenizer(new ProductCodeTokenizer);
$indexer->query('SELECT id, name, sku FROM products;');
$indexer->run();

Example: keeping hyphens in tokens

The default tokenizer treats hyphens as word separators. If you need to keep hyphenated terms together (e.g., product codes like "70-200"):

class HyphenKeepingTokenizer extends AbstractTokenizer implements TokenizerInterface
{
    protected static $pattern = '/[\s,\.]+/';

    public function tokenize($text, $stopwords = [])
    {
        return preg_split(
            $this->getPattern(),
            mb_strtolower($text),
            -1,
            PREG_SPLIT_NO_EMPTY
        );
    }
}

// "Canon 70-200" tokenizes to ["canon", "70-200"]

Important

The tokenizer used at search time must match the tokenizer used at index time. TNTSearch stores the tokenizer class name in the index metadata and loads it automatically when you call selectIndex().