To illustrate how this partitioning scheme allows for a balanced cluster
assignment, we used 4450 email addresses
from the Enron dataset to simulate
arbitrary email addresses (keys) and we calculated how they would be assigned
across our 5 clusters using the Python script below:
1
2
3
4
5
6
N = 5
counts = [0for i in range(N)]
for email in open('chapter14/enron.txt'):
counts[hash(email.rstrip()) % N] += 1
print(counts)
Press desired key combination and then press ENTER.