Contributing
Architecture guide
This guide explains the internal architecture of TNTSearch, how the major components fit together, and the design decisions behind the library.
High-level overview
TNTSearch is organized into these major subsystems:
- Indexing - Reads data from a source (database or filesystem), tokenizes and stems it, and stores an inverted index
- Searching - Takes a query, processes it through the same tokenizer and stemmer, looks up the inverted index, and ranks results using BM25
- Storage engines - Abstraction layer that allows the inverted index to be stored in SQLite or Redis
- Connectors - Database connectors that read source data from MySQL, PostgreSQL, SQLite, SQL Server, or the filesystem
Directory structure
src/
├── TNTSearch.php # Main search API
├── TNTGeoSearch.php # Geographic search
├── TNTFuzzyMatch.php # Vector-based fuzzy matching
├── Indexer/
│ ├── TNTIndexer.php # Main indexer
│ └── TNTGeoIndexer.php # Geographic indexer
├── Engines/
│ ├── SqliteEngine.php # SQLite storage engine
│ ├── RedisEngine.php # Redis storage engine
│ └── EngineTrait.php # Shared engine behavior
├── Contracts/
│ └── EngineContract.php # Engine interface
├── Connectors/ # Database connectors
│ ├── MySqlConnector.php
│ ├── PostgresConnector.php
│ ├── SqliteConnector.php
│ ├── SqlServerConnector.php
│ └── FileSystemConnector.php
├── Stemmer/ # Language stemmers
│ ├── PorterStemmer.php
│ ├── GermanStemmer.php
│ ├── CroatianStemmer.php
│ └── ... (12 total)
├── Support/ # Tokenizers and utilities
│ ├── Tokenizer.php
│ ├── NGramTokenizer.php
│ ├── Highlighter.php
│ ├── Expression.php
│ └── Collection.php
├── Classifier/
│ └── TNTClassifier.php # Naive Bayes classifier
├── KeywordExtraction/
│ └── Rake.php # RAKE keyword extraction
├── Spell/
│ └── JaroWinklerDistance.php # String distance
├── Stopwords/ # JSON stopword files
└── FileReaders/
└── TextFileReader.php # Plain text file reader
The inverted index
At the core of TNTSearch is the inverted index, a data structure that maps terms to the documents containing them. Given a document like:
"Romeo and Juliet is a tragedy"
The inverted index stores entries like:
| Term | Documents |
|---|---|
| romeo | doc_1 (1 hit) |
| juliet | doc_1 (1 hit) |
| tragedi | doc_1 (1 hit) |
Notice that "tragedy" is stored as "tragedi" because the stemmer reduced it to its root form. Stopwords like "and", "is", "a" are typically filtered out.
Engine contract
Both storage engines (SQLite and Redis) implement the EngineContract interface, which defines methods for:
- Creating and selecting indexes
- Storing and retrieving wordlists and document lists
- Running queries against the source database
- Processing documents (tokenizing, stemming, saving)
- Fuzzy search and as-you-type functionality
The EngineTrait contains shared logic used by both engines, including text stemming, document processing, and connector creation.
Search flow
When you call $tnt->search("romeo juliet"), here's what happens:
- Tokenize: The query is split into tokens:
["romeo", "juliet"] - Stem: Each token is stemmed:
["romeo", "juliet"] - Lookup: For each stemmed term, retrieve matching documents from the inverted index
- Score: Calculate BM25 relevance scores for each document:
- TF (Term Frequency): How often the term appears in the document
- IDF (Inverse Document Frequency): How rare the term is across all documents
- Document length: Normalized by average document length
- Rank: Sort documents by score (highest first)
- Return: Return the top N document IDs
BM25 ranking
TNTSearch uses the BM25 algorithm for relevance ranking:
score = IDF * ((tfWeight + 1) * tf) / (tfWeight * ((1 - dlWeight) + dlWeight) + tf)
Where:
tf= term frequency in the documentIDF=log(totalDocs / matchingDocs)tfWeight= 1 (term frequency weight)dlWeight= 0.5 (document length weight)
Documents matching more query terms and containing rarer terms score higher.
Boolean search flow
Boolean search uses a different approach:
- Parse: Convert the query to postfix notation using the shunting-yard algorithm
- Evaluate: Process each token:
- Terms: Look up document sets from the inverted index
- AND (
&): Intersection of two document sets - OR (
|): Union of two document sets - NOT (
~): Difference (exclude documents)
- Return: Return the resulting document set
Extending TNTSearch
Custom engines
Implement the EngineContract interface to create a new storage backend:
use TeamTNT\TNTSearch\Contracts\EngineContract;
class MyCustomEngine implements EngineContract
{
// Implement all required methods
}
Custom tokenizers
Extend AbstractTokenizer and implement TokenizerInterface. See the Custom tokenizers guide.
Custom stemmers
Create a class with a stem($word) method that returns the stemmed form of the word.
Custom file readers
Implement a reader class for non-text file formats and set it on the indexer with setFileReader().
