What is IDF and how is it calculated?
Open Source Your Knowledge, Become a Contributor
Technology knowledge has to be shared and made accessible for free. Join the movement.
Inverse Document Frequency
IDF is one of the most basic terms of modern search engine relevance calculation. It is used to determine how rare a term is and how relevant it is to the original query. For example take the query "the Golden State Warriors". This query is difficult because there is no invidual word that identifies our intention to search for a basketball team. Instead we need to look at groups of words and weigh how relevant each set is to the overall query. This is the basics of flat search query relevance and it all starts with IDF.
Before we can calculate IDF we need to associate each document or query with a set of features. For this tutorial we will use only n-grams. An n-gram is one or more words. We can use python's string methods to quickly extract features from a document or query.
Next we need to calculate Document Frequency, then invert it. The formula for IDF starts with the total number of documents in our database: N. Then we divide this by the number of documents containing our term: tD. This will never result in a number less than 1, because 1 indicates that the term is present in all documents, there is no document frequency more common than that limit. Next we normally take the logarithm of this whole term, because we may be indexing billions of documents, and the IDF can get pretty unwieldy unless we refer to it in terms of order of magnitude.
Here we can calculate the IDF for all of our features in a small database of documents.
As you can see in the output, rare terms are assigned higher IDF and thus can be weighted higher in relevancy calculation.