Results 1 to 1 of 1
- 02-04-2012, 06:35 PM #1Member
- Join Date
- Aug 2011
- Rep Power
To compare one document with a list of other documents using COSINE SIMILARITY.
In process of segregating a series of webpages crawled over the internet, i have a set of text files containing the keywords collected from each webpage.
For example: say i crawled webpages of Cricket, Football and Basketball. (say for time being 1 website each)
I use Lucene and Tika to extract the text contents of each webpage and save it as text files..since its in an un-encrypted form)
NOW , my TASK : given an unknown website , depending on the keywords present in it, I'm supposed to define if it is a cricket, football or a basketball page .
I'm supposed to use cosine similarity/distribution.
So i have 3 text files with a set of words in each. Given a 4th text file, i'm supposed to get the similarity of this file with all the 3.. then to which it is d most similar.
For any kind of further explanation required, pls ask..
- By fullHD in forum LuceneReplies: 0Last Post: 12-14-2011, 11:13 AM
- By panny in forum New To JavaReplies: 4Last Post: 03-21-2011, 02:03 PM
- By aneuryzma in forum LuceneReplies: 0Last Post: 03-01-2011, 12:48 PM
- By FunkyProg in forum New To JavaReplies: 6Last Post: 01-25-2011, 01:55 PM
- By firstname.lastname@example.org in forum LuceneReplies: 0Last Post: 06-30-2010, 09:49 AM