Advanced features
Keyword extraction
TNTSearch includes a keyword extraction feature based on the RAKE (Rapid Automatic Keyword Extraction) algorithm. RAKE identifies key phrases in text by analyzing word frequency and co-occurrence, without requiring any training data.
Basic usage
use TeamTNT\TNTSearch\KeywordExtraction\Rake;
$rake = new Rake;
$keywords = $rake->extractKeywords("A scoop of ice cream");
// ["ice cream" => 4, "scoop" => 1]
RAKE automatically filters out stopwords and scores phrases based on word degree and frequency.
Extracting from longer text
RAKE works particularly well with longer, more technical text:
$rake = new Rake;
$text = "Compatibility of systems of linear constraints over the set of natural
numbers. Criteria of compatibility of a system of linear Diophantine
equations, strict inequations, and nonstrict inequations are considered.
Upper bounds for components of a minimal set of solutions and algorithms
of construction of minimal generating sets of solutions for all types of
systems are given.";
$keywords = $rake->extractKeywords($text);
// [
// "minimal generating sets" => 8.67,
// "linear diophantine equations" => 8.5,
// "minimal supporting set" => 7.67,
// ...
// ]
The returned array is sorted by score (highest first), with the top third of phrases returned.
Without scores
To get just the keyword phrases without scores:
$keywords = $rake->extractKeywords($text, false);
// ["minimal generating sets", "linear diophantine equations", ...]
Language support
RAKE uses stopword lists to identify phrase boundaries. By default, it uses English stopwords. You can specify a different language:
$rake = new Rake('english'); // default
$rake = new Rake('french');
$rake = new Rake('german');
$rake = new Rake('spanish');
$rake = new Rake('croatian');
$rake = new Rake('russian');
$rake = new Rake('italian');
$rake = new Rake('latvian');
$rake = new Rake('ukrainian');
How RAKE works
- Tokenize the text into words
- Split into candidate phrases at stopword boundaries
- Calculate word scores as
degree(word) / frequency(word), where degree is the sum of phrase lengths containing the word - Score phrases as the sum of their word scores
- Return the top third of phrases, sorted by score
Helper methods
The Rake class exposes lower-level methods for custom analysis:
$rake = new Rake;
// Generate candidate keyword phrases
$phrases = $rake->generateCandidateKeywords("A scoop of ice cream");
// [["scoop"], ["ice", "cream"]]
// Calculate word degree (co-occurrence measure)
$degree = $rake->wordDegree("ice", $phrases);
// 2
// Calculate word frequency
$freq = $rake->wordFrequency("ice", $phrases);
// 1
// Get full word scores
$scores = $rake->calculateWordScores($phrases);
// ["scoop" => 1, "ice" => 2, "cream" => 2]
