Architecture guide

This guide explains the internal architecture of TNTSearch, how the major components fit together, and the design decisions behind the library.

High-level overview

TNTSearch is organized into these major subsystems:

Indexing - Reads data from a source (database or filesystem), tokenizes and stems it, and stores an inverted index
Searching - Takes a query, processes it through the same tokenizer and stemmer, looks up the inverted index, and ranks results using BM25
Storage engines - Abstraction layer that allows the inverted index to be stored in SQLite or Redis
Connectors - Database connectors that read source data from MySQL, PostgreSQL, SQLite, SQL Server, or the filesystem

Directory structure

src/
├── TNTSearch.php              # Main search API
├── TNTGeoSearch.php           # Geographic search
├── TNTFuzzyMatch.php          # Vector-based fuzzy matching
├── Indexer/
│   ├── TNTIndexer.php         # Main indexer
│   └── TNTGeoIndexer.php      # Geographic indexer
├── Engines/
│   ├── SqliteEngine.php       # SQLite storage engine
│   ├── RedisEngine.php        # Redis storage engine
│   └── EngineTrait.php        # Shared engine behavior
├── Contracts/
│   └── EngineContract.php     # Engine interface
├── Connectors/                # Database connectors
│   ├── MySqlConnector.php
│   ├── PostgresConnector.php
│   ├── SqliteConnector.php
│   ├── SqlServerConnector.php
│   └── FileSystemConnector.php
├── Stemmer/                   # Language stemmers
│   ├── PorterStemmer.php
│   ├── GermanStemmer.php
│   ├── CroatianStemmer.php
│   └── ... (12 total)
├── Support/                   # Tokenizers and utilities
│   ├── Tokenizer.php
│   ├── NGramTokenizer.php
│   ├── Highlighter.php
│   ├── Expression.php
│   └── Collection.php
├── Classifier/
│   └── TNTClassifier.php      # Naive Bayes classifier
├── KeywordExtraction/
│   └── Rake.php               # RAKE keyword extraction
├── Spell/
│   └── JaroWinklerDistance.php # String distance
├── Stopwords/                 # JSON stopword files
└── FileReaders/
    └── TextFileReader.php     # Plain text file reader

The inverted index

At the core of TNTSearch is the inverted index, a data structure that maps terms to the documents containing them. Given a document like:

"Romeo and Juliet is a tragedy"

The inverted index stores entries like:

Term	Documents
romeo	doc_1 (1 hit)
juliet	doc_1 (1 hit)
tragedi	doc_1 (1 hit)

Notice that "tragedy" is stored as "tragedi" because the stemmer reduced it to its root form. Stopwords like "and", "is", "a" are typically filtered out.

Engine contract

Both storage engines (SQLite and Redis) implement the EngineContract interface, which defines methods for:

Creating and selecting indexes
Storing and retrieving wordlists and document lists
Running queries against the source database
Processing documents (tokenizing, stemming, saving)
Fuzzy search and as-you-type functionality

The EngineTrait contains shared logic used by both engines, including text stemming, document processing, and connector creation.

Search flow

When you call $tnt->search("romeo juliet"), here's what happens:

Tokenize: The query is split into tokens: ["romeo", "juliet"]
Stem: Each token is stemmed: ["romeo", "juliet"]
Lookup: For each stemmed term, retrieve matching documents from the inverted index
Score: Calculate BM25 relevance scores for each document:
- TF (Term Frequency): How often the term appears in the document
- IDF (Inverse Document Frequency): How rare the term is across all documents
- Document length: Normalized by average document length
Rank: Sort documents by score (highest first)
Return: Return the top N document IDs

BM25 ranking

TNTSearch uses the BM25 algorithm for relevance ranking:

score = IDF * ((tfWeight + 1) * tf) / (tfWeight * ((1 - dlWeight) + dlWeight) + tf)

Where:

tf = term frequency in the document
IDF = log(totalDocs / matchingDocs)
tfWeight = 1 (term frequency weight)
dlWeight = 0.5 (document length weight)

Documents matching more query terms and containing rarer terms score higher.

Boolean search flow

Boolean search uses a different approach:

Parse: Convert the query to postfix notation using the shunting-yard algorithm
Evaluate: Process each token:
- Terms: Look up document sets from the inverted index
- AND (&): Intersection of two document sets
- OR (|): Union of two document sets
- NOT (~): Difference (exclude documents)
Return: Return the resulting document set

Extending TNTSearch

Custom engines

Implement the EngineContract interface to create a new storage backend:

use TeamTNT\TNTSearch\Contracts\EngineContract;

class MyCustomEngine implements EngineContract
{
    // Implement all required methods
}

Custom tokenizers

Extend AbstractTokenizer and implement TokenizerInterface. See the Custom tokenizers guide.

Custom stemmers

Create a class with a stem($word) method that returns the stemmed form of the word.

Custom file readers

Implement a reader class for non-text file formats and set it on the indexer with setFileReader().