Text classification

TNTSearch includes a Naive Bayes text classifier that can learn to categorize documents based on training data. This is useful for spam detection, sentiment analysis, topic categorization, and other text classification tasks.

Basic usage

The classifier has a simple learn-and-predict API:

use TeamTNT\TNTSearch\Classifier\TNTClassifier;

$classifier = new TNTClassifier();

// Train the classifier with labeled examples
$classifier->learn("A great game", "Sports");
$classifier->learn("The election was over", "Not sports");
$classifier->learn("Very clean match", "Sports");
$classifier->learn("A clean but forgettable game", "Sports");

// Predict the category of new text
$result = $classifier->predict("It was a close election");
echo $result['label'];      // "Not sports"
echo $result['likelihood'];  // log-likelihood score

Training the classifier

Use the learn() method to train the classifier with labeled examples. The more examples you provide, the more accurate predictions will be:

$classifier = new TNTClassifier();

// Provide many training examples per category
$classifier->learn("chinese beijing chinese", "chinese");
$classifier->learn("chinese chinese shanghai", "chinese");
$classifier->learn("chinese macao", "chinese");
$classifier->learn("tokyo japan chinese", "japanese");

$result = $classifier->predict("chinese chinese chinese tokyo japan");
echo $result['label']; // "chinese"

Training from a dataset

For real-world applications, you'll typically train from a dataset:

$classifier = new TNTClassifier();

// Load training data (e.g., labeled SMS messages)
$data = json_decode(file_get_contents('sms-texts.json'));

// Use 80% for training
$trainingSize = count($data) * 0.80;

for ($i = 0; $i < $trainingSize; $i++) {
    $classifier->learn($data[$i]->message, $data[$i]->label);
}

// Test with remaining 20%
$correct = 0;
$total = 0;
for ($i = $trainingSize; $i < count($data); $i++) {
    $total++;
    $guess = $classifier->predict($data[$i]->message);
    if ($guess['label'] == $data[$i]->label) {
        $correct++;
    }
}

$accuracy = ($correct / $total) * 100;
echo "Accuracy: $accuracy%"; // Typically 98%+ for spam detection

Saving and loading

You can persist a trained classifier to disk and load it later:

// Train and save
$classifier = new TNTClassifier();
$classifier->learn("spam message", "spam");
$classifier->learn("normal message", "ham");
// ... many more examples
$classifier->save('/path/to/classifier.dat');

// Load later
$classifier = new TNTClassifier();
$classifier->load('/path/to/classifier.dat');

$result = $classifier->predict("Win a free prize!");

How it works

The classifier uses the Naive Bayes algorithm with Laplace smoothing:

Learning: For each labeled document, it tokenizes the text and counts word occurrences per category
Prediction: For new text, it calculates the log-likelihood of each category using:
- P(category): The prior probability (how common each category is in training data)
- P(word|category): The probability of each word appearing in a given category
The category with the highest combined probability wins

Laplace smoothing (add-one smoothing) ensures that unseen words don't zero out the probability calculation.

Use cases

Spam detection: Train with spam/ham labeled messages
Sentiment analysis: Train with positive/negative labeled reviews
Topic categorization: Assign articles to topics (sports, politics, tech, etc.)
Language detection: Train with text samples from different languages

Tip

The classifier uses the same tokenizer and stemmer as the search engine. For best results, ensure your training data is representative of the text you'll be classifying in production.