Advanced features
Text classification
TNTSearch includes a Naive Bayes text classifier that can learn to categorize documents based on training data. This is useful for spam detection, sentiment analysis, topic categorization, and other text classification tasks.
Basic usage
The classifier has a simple learn-and-predict API:
use TeamTNT\TNTSearch\Classifier\TNTClassifier;
$classifier = new TNTClassifier();
// Train the classifier with labeled examples
$classifier->learn("A great game", "Sports");
$classifier->learn("The election was over", "Not sports");
$classifier->learn("Very clean match", "Sports");
$classifier->learn("A clean but forgettable game", "Sports");
// Predict the category of new text
$result = $classifier->predict("It was a close election");
echo $result['label']; // "Not sports"
echo $result['likelihood']; // log-likelihood score
Training the classifier
Use the learn() method to train the classifier with labeled examples. The more examples you provide, the more accurate predictions will be:
$classifier = new TNTClassifier();
// Provide many training examples per category
$classifier->learn("chinese beijing chinese", "chinese");
$classifier->learn("chinese chinese shanghai", "chinese");
$classifier->learn("chinese macao", "chinese");
$classifier->learn("tokyo japan chinese", "japanese");
$result = $classifier->predict("chinese chinese chinese tokyo japan");
echo $result['label']; // "chinese"
Training from a dataset
For real-world applications, you'll typically train from a dataset:
$classifier = new TNTClassifier();
// Load training data (e.g., labeled SMS messages)
$data = json_decode(file_get_contents('sms-texts.json'));
// Use 80% for training
$trainingSize = count($data) * 0.80;
for ($i = 0; $i < $trainingSize; $i++) {
$classifier->learn($data[$i]->message, $data[$i]->label);
}
// Test with remaining 20%
$correct = 0;
$total = 0;
for ($i = $trainingSize; $i < count($data); $i++) {
$total++;
$guess = $classifier->predict($data[$i]->message);
if ($guess['label'] == $data[$i]->label) {
$correct++;
}
}
$accuracy = ($correct / $total) * 100;
echo "Accuracy: $accuracy%"; // Typically 98%+ for spam detection
Saving and loading
You can persist a trained classifier to disk and load it later:
// Train and save
$classifier = new TNTClassifier();
$classifier->learn("spam message", "spam");
$classifier->learn("normal message", "ham");
// ... many more examples
$classifier->save('/path/to/classifier.dat');
// Load later
$classifier = new TNTClassifier();
$classifier->load('/path/to/classifier.dat');
$result = $classifier->predict("Win a free prize!");
How it works
The classifier uses the Naive Bayes algorithm with Laplace smoothing:
- Learning: For each labeled document, it tokenizes the text and counts word occurrences per category
- Prediction: For new text, it calculates the log-likelihood of each category using:
- P(category): The prior probability (how common each category is in training data)
- P(word|category): The probability of each word appearing in a given category
- The category with the highest combined probability wins
Laplace smoothing (add-one smoothing) ensures that unseen words don't zero out the probability calculation.
Use cases
- Spam detection: Train with spam/ham labeled messages
- Sentiment analysis: Train with positive/negative labeled reviews
- Topic categorization: Assign articles to topics (sports, politics, tech, etc.)
- Language detection: Train with text samples from different languages
Tip
The classifier uses the same tokenizer and stemmer as the search engine. For best results, ensure your training data is representative of the text you'll be classifying in production.
