Results 1 to 1 of 1
- 02-04-2012, 06:35 PM #1
Member
- Join Date
- Aug 2011
- Posts
- 2
- Rep Power
- 0
To compare one document with a list of other documents using COSINE SIMILARITY.
In process of segregating a series of webpages crawled over the internet, i have a set of text files containing the keywords collected from each webpage.
For example: say i crawled webpages of Cricket, Football and Basketball. (say for time being 1 website each)
I use Lucene and Tika to extract the text contents of each webpage and save it as text files..since its in an un-encrypted form)
NOW , my TASK : given an unknown website , depending on the keywords present in it, I'm supposed to define if it is a cricket, football or a basketball page .
I'm supposed to use cosine similarity/distribution.
So i have 3 text files with a set of words in each. Given a 4th text file, i'm supposed to get the similarity of this file with all the 3.. then to which it is d most similar.
Please help.
For any kind of further explanation required, pls ask..
Similar Threads
-
a list of documents for filtering search
By fullHD in forum LuceneReplies: 0Last Post: 12-14-2011, 11:13 AM -
cosine similarity in search engine
By panny in forum New To JavaReplies: 4Last Post: 03-21-2011, 02:03 PM -
Lucene: getting similarity scores per document field
By aneuryzma in forum LuceneReplies: 0Last Post: 03-01-2011, 12:48 PM -
Compare grocery list with leaflet
By FunkyProg in forum New To JavaReplies: 6Last Post: 01-25-2011, 01:55 PM -
is Cosine Similarity the Default Similarity in Lucene?
By sethu.iit@gmail.com in forum LuceneReplies: 0Last Post: 06-30-2010, 09:49 AM


LinkBack URL
About LinkBacks
Reply With Quote
Bookmarks