Keyword extraction

TNTSearch includes a keyword extraction feature based on the RAKE (Rapid Automatic Keyword Extraction) algorithm. RAKE identifies key phrases in text by analyzing word frequency and co-occurrence, without requiring any training data.

Basic usage

use TeamTNT\TNTSearch\KeywordExtraction\Rake;

$rake = new Rake;

$keywords = $rake->extractKeywords("A scoop of ice cream");
// ["ice cream" => 4, "scoop" => 1]

RAKE automatically filters out stopwords and scores phrases based on word degree and frequency.

Extracting from longer text

RAKE works particularly well with longer, more technical text:

$rake = new Rake;

$text = "Compatibility of systems of linear constraints over the set of natural
    numbers. Criteria of compatibility of a system of linear Diophantine
    equations, strict inequations, and nonstrict inequations are considered.
    Upper bounds for components of a minimal set of solutions and algorithms
    of construction of minimal generating sets of solutions for all types of
    systems are given.";

$keywords = $rake->extractKeywords($text);
// [
//     "minimal generating sets" => 8.67,
//     "linear diophantine equations" => 8.5,
//     "minimal supporting set" => 7.67,
//     ...
// ]

The returned array is sorted by score (highest first), with the top third of phrases returned.

Without scores

To get just the keyword phrases without scores:

$keywords = $rake->extractKeywords($text, false);
// ["minimal generating sets", "linear diophantine equations", ...]

Language support

RAKE uses stopword lists to identify phrase boundaries. By default, it uses English stopwords. You can specify a different language:

$rake = new Rake('english');   // default
$rake = new Rake('french');
$rake = new Rake('german');
$rake = new Rake('spanish');
$rake = new Rake('croatian');
$rake = new Rake('russian');
$rake = new Rake('italian');
$rake = new Rake('latvian');
$rake = new Rake('ukrainian');

How RAKE works

Tokenize the text into words
Split into candidate phrases at stopword boundaries
Calculate word scores as degree(word) / frequency(word), where degree is the sum of phrase lengths containing the word
Score phrases as the sum of their word scores
Return the top third of phrases, sorted by score

Helper methods

The Rake class exposes lower-level methods for custom analysis:

$rake = new Rake;

// Generate candidate keyword phrases
$phrases = $rake->generateCandidateKeywords("A scoop of ice cream");
// [["scoop"], ["ice", "cream"]]

// Calculate word degree (co-occurrence measure)
$degree = $rake->wordDegree("ice", $phrases);
// 2

// Calculate word frequency
$freq = $rake->wordFrequency("ice", $phrases);
// 1

// Get full word scores
$scores = $rake->calculateWordScores($phrases);
// ["scoop" => 1, "ice" => 2, "cream" => 2]