TF-IDF Implementation with C++

2015-06-12

TF-IDF weight is widely used in text mining. It measures the importances of a word to a document in corpus. Recently I was doing with music recommendation algirhtms, and I have found that many papers were using the TF-IDF to measure the lyric similarity between musics. I have searched and did not find a TF-IDF library, so I decided to code one by myself.

Basics

TF-IDF weight is calculated by 2 components, Term Frequency (TF) and Inverse Document Frequency (IDF). The definations of TF-IDF weight of a term j in document i is shown below.

$w_{ij} = tf_{ij} \ast log \big( \frac{N}{n_{j}} \big)$

where tf_ij is the frequency of term j in document i, N is total number of documents, and n_j is number of documents contains term j.

For more details, please refer to TF-IDF Tutorial.

Code

Assume we have 25 text files, each text file is a document. The code here will compute the TF-IDF weight for these 25 documents.

Firstly, we need to split the input file contents to words. Here I use the boost::tokenizer, it can split the std::string by everything which are not numbers and letters.

std::vector<std::string> textParse(std::string & bigString)
{
	std::vector<std::string> vec;
	boost::tokenizer<> tok(bigString);
	for (boost::tokenizer<>::iterator beg = tok.begin(); beg != tok.end(); ++ beg)
	{
	    vec.push_back(*beg);
	}
	return vec;
}

The return value is a std::vector<std::string>, using a std::vector<std::vector<std::string>> rawDataSet contains the vector of all documents.

Next, we need a word list, which should include all unique words in all documents. In this example, the word list is in std::set<std::string> vocabListSet。

void createVocabList()
{
	std::set<std::string> vocabListSet;
	for (std::vector<std::string> document : rawDataSet)
	{
		for (std::string word : document)
			vocabListSet.insert(word);
	}
	std::copy(vocabListSet.begin(), vocabListSet.end(), std::back_inserter(vocabList));
}

Using the word list, we can convert each documents to digits with Bag-of-words model. The function below converts std::vector of words in a document to a bag-of-words vector.

Eigen::VectorXf bagOfWords2VecMN(std::vector<std::string> & inputSet)
{
	std::vector<float> returnVec(vocabList.size(), 0);
	for (std::string word : inputSet)
	{
		size_t idx = std::find(vocabList.begin(), vocabList.end(), word) - vocabList.begin();
		if (idx == vocabList.size())
			cout << "word: " << word << "not found" << endl;
		else
			returnVec.at(idx) += 1;
	}
	Eigen::Map<Eigen::VectorXf> v(returnVec.data(),returnVec.size());
	return v;
}

The return value type of bagOfWords2VecMN() is Eigen::VectorXf. After convert all documents in rawDataSet, the vectors of all documents are in Eigen::MatrixXf dataMat.

void vec2mat()
{
	std::vector<Eigen::VectorXf> vec;
	for (std::vector<std::string> document : rawDataSet)
	{
		vec.push_back(bagOfWords2VecMN(document));
	}
	ncol = vec[0].size();
	nrow = vec.size();
	dataMat.resize(nrow, ncol);
	for (int i = 0; i < nrow; ++i)
	{
		dataMat.row(i) = vec[i];
	}
	rawDataSet.clear(); // release memory
}

After above functions, we have a matrix includes the number of occurance of each term in each document. To get the n_j in the weight equation, one possible solution is:

Create another matrix using set-of-words model, which marks 1 when the document have the term, no matter how many times the term occcurs.
Sum all the rows in the matrix, and then we have a vector includes the occurance of all documents.

The code here convert the bag-of-word vector to set-of-word vector, and sum all of them together.

Eigen::MatrixXf dataMat2(dataMat); // make a copy of dataMat
Eigen::VectorXf termCount; // the sumed vector
termCount.resize(ncol); // ncol is number of column of dataMat
for (int i = 0; i != nrow; ++i)
{
	for (int j = 0; j != ncol; ++j)
	{ // for each element in dataMat2
		if (dataMat2(i,j) > 1) // only keep 1 and 0
			dataMat2(i,j) = 1;
	}
	termCount += dataMat2.row(i); // no. of doc. each term appears
}
dataMat2.resize(0,0); //release

Finally, the weight of each term j to document i is

1
2
3

double tf = dataMat(i,j) / (dataMat.row(i).sum());
double idf = log((double)nrow / (termCount(j))); // nrow is number of rows of dataMat
double weight = tf * idf;

See the complete code here