TF-IDF weight is widely used in text mining. It measures the importances of a word to a document in corpus. Recently I was doing with music recommendation algirhtms, and I have found that many papers were using the TF-IDF to measure the lyric similarity between musics. I have searched and did not find a TF-IDF library, so I decided to code one by myself.
Basics
TF-IDF weight is calculated by 2 components, Term Frequency (TF) and Inverse Document Frequency (IDF). The definations of TF-IDF weight of a term j in document i is shown below.
where tfij is the frequency of term j in document i, N is total number of documents, and nj is number of documents contains term j.
For more details, please refer to TF-IDF Tutorial.
Code
Assume we have 25 text files, each text file is a document. The code here will compute the TF-IDF weight for these 25 documents.
Firstly, we need to split the input file contents to words. Here I use the boost::tokenizer
, it can split the std::string
by everything which are not numbers and letters.1
2
3
4
5
6
7
8
9
10std::vector<std::string> textParse(std::string & bigString)
{
std::vector<std::string> vec;
boost::tokenizer<> tok(bigString);
for (boost::tokenizer<>::iterator beg = tok.begin(); beg != tok.end(); ++ beg)
{
vec.push_back(*beg);
}
return vec;
}
The return value is a std::vector<std::string>
, using a std::vector<std::vector<std::string>> rawDataSet
contains the vector of all documents.
Next, we need a word list, which should include all unique words in all documents. In this example, the word list is in std::set<std::string> vocabListSet
。1
2
3
4
5
6
7
8
9
10void createVocabList()
{
std::set<std::string> vocabListSet;
for (std::vector<std::string> document : rawDataSet)
{
for (std::string word : document)
vocabListSet.insert(word);
}
std::copy(vocabListSet.begin(), vocabListSet.end(), std::back_inserter(vocabList));
}
Using the word list, we can convert each documents to digits with Bag-of-words model. The function below converts std::vector
of words in a document to a bag-of-words vector.1
2
3
4
5
6
7
8
9
10
11
12
13
14Eigen::VectorXf bagOfWords2VecMN(std::vector<std::string> & inputSet)
{
std::vector<float> returnVec(vocabList.size(), 0);
for (std::string word : inputSet)
{
size_t idx = std::find(vocabList.begin(), vocabList.end(), word) - vocabList.begin();
if (idx == vocabList.size())
cout << "word: " << word << "not found" << endl;
else
returnVec.at(idx) += 1;
}
Eigen::Map<Eigen::VectorXf> v(returnVec.data(),returnVec.size());
return v;
}
The return value type of bagOfWords2VecMN()
is Eigen::VectorXf
. After convert all documents in rawDataSet
, the vectors of all documents are in Eigen::MatrixXf dataMat
.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16void vec2mat()
{
std::vector<Eigen::VectorXf> vec;
for (std::vector<std::string> document : rawDataSet)
{
vec.push_back(bagOfWords2VecMN(document));
}
ncol = vec[0].size();
nrow = vec.size();
dataMat.resize(nrow, ncol);
for (int i = 0; i < nrow; ++i)
{
dataMat.row(i) = vec[i];
}
rawDataSet.clear(); // release memory
}
After above functions, we have a matrix includes the number of occurance of each term in each document. To get the nj in the weight equation, one possible solution is:
- Create another matrix using set-of-words model, which marks 1 when the document have the term, no matter how many times the term occcurs.
- Sum all the rows in the matrix, and then we have a vector includes the occurance of all documents.
The code here convert the bag-of-word vector to set-of-word vector, and sum all of them together.1
2
3
4
5
6
7
8
9
10
11
12
13Eigen::MatrixXf dataMat2(dataMat); // make a copy of dataMat
Eigen::VectorXf termCount; // the sumed vector
termCount.resize(ncol); // ncol is number of column of dataMat
for (int i = 0; i != nrow; ++i)
{
for (int j = 0; j != ncol; ++j)
{ // for each element in dataMat2
if (dataMat2(i,j) > 1) // only keep 1 and 0
dataMat2(i,j) = 1;
}
termCount += dataMat2.row(i); // no. of doc. each term appears
}
dataMat2.resize(0,0); //release
Finally, the weight of each term j to document i is1
2
3double tf = dataMat(i,j) / (dataMat.row(i).sum());
double idf = log((double)nrow / (termCount(j))); // nrow is number of rows of dataMat
double weight = tf * idf;
See the complete code here