In process of segregating a series of webpages crawled over the internet, i have a set of text files containing the keywords collected from each webpage.
For example: say i crawled webpages of Cricket, Football and Basketball. (say for time being 1 website each)
I use Lucene and Tika to extract the text contents of each webpage and save it as text files..since its in an un-encrypted form)

NOW , my TASK : given an unknown website , depending on the keywords present in it, I'm supposed to define if it is a cricket, football or a basketball page .
I'm supposed to use cosine similarity/distribution.

So i have 3 text files with a set of words in each. Given a 4th text file, i'm supposed to get the similarity of this file with all the 3.. then to which it is d most similar.

Please help.

For any kind of further explanation required, pls ask..