TF-IDF Implementation with C++

TF-IDF weight is widely used in text mining. It measures the importances of a word to a document in corpus. Recently I was doing with music recommendation algirhtms, and I have found that many papers were using the TF-IDF to measure the lyric similarity between musics. I have searched and did not find a TF-IDF library, so I decided to code one by myself.

Basics

TF-IDF weight is calculated by 2 components, Term Frequency (TF) and Inverse Document Frequency (IDF). The definations of TF-IDF weight of a term j in document i is shown below.

where tfij is the frequency of term j in document i, N is total number of documents, and nj is number of documents contains term j.

For more details, please refer to TF-IDF Tutorial.

Code

Assume we have 25 text files, each text file is a document. The code here will compute the TF-IDF weight for these 25 documents.

Firstly, we need to split the input file contents to words. Here I use the boost::tokenizer, it can split the std::string by everything which are not numbers and letters.

1
2
3
4
5
6
7
8
9
10
std::vector<std::string> textParse(std::string & bigString)
{
std::vector<std::string> vec;
boost::tokenizer<> tok(bigString);
for (boost::tokenizer<>::iterator beg = tok.begin(); beg != tok.end(); ++ beg)
{
vec.push_back(*beg);
}
return vec;
}

The return value is a std::vector<std::string>, using a std::vector<std::vector<std::string>> rawDataSet contains the vector of all documents.

Next, we need a word list, which should include all unique words in all documents. In this example, the word list is in std::set<std::string> vocabListSet

1
2
3
4
5
6
7
8
9
10
void createVocabList()
{

std::set<std::string> vocabListSet;
for (std::vector<std::string> document : rawDataSet)
{
for (std::string word : document)
vocabListSet.insert(word);
}
std::copy(vocabListSet.begin(), vocabListSet.end(), std::back_inserter(vocabList));
}

Using the word list, we can convert each documents to digits with Bag-of-words model. The function below converts std::vector of words in a document to a bag-of-words vector.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Eigen::VectorXf bagOfWords2VecMN(std::vector<std::string> & inputSet)
{

std::vector<float> returnVec(vocabList.size(), 0);
for (std::string word : inputSet)
{
size_t idx = std::find(vocabList.begin(), vocabList.end(), word) - vocabList.begin();
if (idx == vocabList.size())
cout << "word: " << word << "not found" << endl;
else
returnVec.at(idx) += 1;
}
Eigen::Map<Eigen::VectorXf> v(returnVec.data(),returnVec.size());
return v;
}

The return value type of bagOfWords2VecMN() is Eigen::VectorXf. After convert all documents in rawDataSet, the vectors of all documents are in Eigen::MatrixXf dataMat.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
void vec2mat()
{

std::vector<Eigen::VectorXf> vec;
for (std::vector<std::string> document : rawDataSet)
{
vec.push_back(bagOfWords2VecMN(document));
}
ncol = vec[0].size();
nrow = vec.size();
dataMat.resize(nrow, ncol);
for (int i = 0; i < nrow; ++i)
{
dataMat.row(i) = vec[i];
}
rawDataSet.clear(); // release memory
}

After above functions, we have a matrix includes the number of occurance of each term in each document. To get the nj in the weight equation, one possible solution is:

  • Create another matrix using set-of-words model, which marks 1 when the document have the term, no matter how many times the term occcurs.
  • Sum all the rows in the matrix, and then we have a vector includes the occurance of all documents.

The code here convert the bag-of-word vector to set-of-word vector, and sum all of them together.

1
2
3
4
5
6
7
8
9
10
11
12
13
Eigen::MatrixXf dataMat2(dataMat); // make a copy of dataMat
Eigen::VectorXf termCount; // the sumed vector
termCount.resize(ncol); // ncol is number of column of dataMat
for (int i = 0; i != nrow; ++i)
{
for (int j = 0; j != ncol; ++j)
{ // for each element in dataMat2
if (dataMat2(i,j) > 1) // only keep 1 and 0
dataMat2(i,j) = 1;
}
termCount += dataMat2.row(i); // no. of doc. each term appears
}
dataMat2.resize(0,0); //release

Finally, the weight of each term j to document i is

1
2
3
double tf = dataMat(i,j) / (dataMat.row(i).sum());
double idf = log((double)nrow / (termCount(j))); // nrow is number of rows of dataMat
double weight = tf * idf;

See the complete code here