What is IDF and how is it calculated?

andrew-lucker
1,176 views

Open Source Your Knowledge, Become a Contributor

Technology knowledge has to be shared and made accessible for free. Join the movement.

Create Content

Inverse Document Frequency

IDF is one of the most basic terms of modern search engine relevance calculation. It is used to determine how rare a term is and how relevant it is to the original query. For example take the query "the Golden State Warriors". This query is difficult because there is no invidual word that identifies our intention to search for a basketball team. Instead we need to look at groups of words and weigh how relevant each set is to the overall query. This is the basics of flat search query relevance and it all starts with IDF.

Before we can calculate IDF we need to associate each document or query with a set of features. For this tutorial we will use only n-grams. An n-gram is one or more words. We can use python's string methods to quickly extract features from a document or query.

1
2
3
4
5
6
7
8
9
10
def extract_features( document ):
terms = tuple(document.lower().split())
features = set()
for i in range(len(terms)):
for n in range(1,4):
if i+n <= len(terms):
features.add(terms[i:i+n])
return features
print(extract_features('The Golden State Warriors'))
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Next we need to calculate Document Frequency, then invert it. The formula for IDF starts with the total number of documents in our database: N. Then we divide this by the number of documents containing our term: tD. This will never result in a number less than 1, because 1 indicates that the term is present in all documents, there is no document frequency more common than that limit. Next we normally take the logarithm of this whole term, because we may be indexing billions of documents, and the IDF can get pretty unwieldy unless we refer to it in terms of order of magnitude.

Here we can calculate the IDF for all of our features in a small database of documents.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def extract_features( document ):
terms = tuple(document.lower().split())
features = set()
for i in range(len(terms)):
for n in range(1,4):
if i+n <= len(terms):
features.add(terms[i:i+n])
return features
documents = [
"This article is about the Golden State Warriors",
"This article is about the Golden Arches",
"This article is about state machines",
"This article is about viking warriors"]
def calculate_idf( documents ):
N = len(documents)
from collections import Counter
tD = Counter()
for d in documents:
features = extract_features(d)
for f in features:
tD[" ".join(f)] += 1
IDF = []
import math
for (term,term_frequency) in tD.items():
term_IDF = math.log(float(N) / term_frequency)
IDF.append(( term_IDF, term ))
IDF.sort(reverse=True)
return IDF
for (IDF, term) in calculate_idf(documents):
print(IDF, term)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

As you can see in the output, rare terms are assigned higher IDF and thus can be weighted higher in relevancy calculation.

Open Source Your Knowledge: become a Contributor and help others learn. Create New Content