Results 1 to 16 of 16

# Thread: tf idf in java

- 12-25-2009, 04:23 AM #1Member
- Join Date
- Dec 2009
- Posts
- 9

- Rep Power
- 0

## tf idf in java

Hi All,

I am looking for a simple java class that can compute tf-idf calculation. I want to do similarity test on 2 documents. I found so many BIG API who used tf-idf class. I do not want to use a big jar file, just to do my simple test. Please help !

Or atlest if some one can tell me how to find TF? and IDF? I will calculate the results :)

OR

If you can tell me some good java tutorial for this.

Please do not tell me for looking google, I already did for 3 days and couldn't find any thing :(

Please do not refer me to Lucene

- 12-25-2009, 09:13 AM #2
- Join Date
- Sep 2008
- Location
- Voorschoten, the Netherlands
- Posts
- 14,071
- Blog Entries
- 7

- Rep Power
- 24

For a simple explanation of this all look here.

kind regards,

Jos

- 12-26-2009, 02:39 AM #3Member
- Join Date
- Dec 2009
- Posts
- 9

- Rep Power
- 0

Thanks JoSAH for your reply. But i am sure you didn't read my post. I had spend 3 days for searching this. Do you think i am dumb who don't even know about wiki? I already saw it

anyways thanks

anybody else?

- 12-26-2009, 09:08 AM #4
- Join Date
- Sep 2008
- Location
- Voorschoten, the Netherlands
- Posts
- 14,071
- Blog Entries
- 7

- Rep Power
- 24

Yes I did read your post and I did read that wiki page; that's why I recommended it. It has a fine explanation of the (simple) math for the subject. What is wrong with it? Did you expect spoonfeeding code? The math isn't rocket science and can be easily implemented by yourself.

kind regards,

JosLast edited by JosAH; 12-26-2009 at 09:57 AM.

- 12-26-2009, 09:48 AM #5Member
- Join Date
- Dec 2009
- Posts
- 9

- Rep Power
- 0

Hi JosAH,

Thank you for your quick reply. No I do not need spoonfeeding. I am beginer so i want to learn things and how to implement them too. Ok let me explain

I know what is TF and IDF and how to calculate them :). But i was looking for the tip how to implement them in java. For example, I have two folders A & b and I want to read file from A to compare them with folder B files to see how much similar they are. Lets say some how I did calculate it. I got some TF/IDF. It would be like 10000 TF/IDF matrix. How would I know which file was more simlar to whom ? ( Sorry If i am not clear, please ask me ). Should i use clustoring techniques or what?

- 12-26-2009, 10:54 AM #6
- Join Date
- Sep 2008
- Location
- Voorschoten, the Netherlands
- Posts
- 14,071
- Blog Entries
- 7

- Rep Power
- 24

Note that the entity 'statistical similarity' between two documents is defined for a single word/sentence/phrase/term or whatever. If a document writes about, say, cats and dogs, while another document writes solely about cats and mice both documents would be quite similar w.r.t. the word 'cat' but will have no similarity at all when you search for 'dog' or 'mouse'.

If you want to compare two sets of documents for one 'item' you indeed end up with a matrix where a document from one set represents a row of the matrix and a document of the other set represents a column of the matrix.

A Set<String, Integer> could handle the mapping from a document name to an index value. You have to compute the similarity for all document combinations AxB to fill in the matrix value. And all that for one single 'item'.

kind regards,

Jos

- 12-28-2009, 01:43 AM #7Member
- Join Date
- Dec 2009
- Posts
- 9

- Rep Power
- 0

Hi JosAH,

Thank you for your reply. It helped me alot to make tf-idf function in java. I made tf but I have one question. As on wiki they wrote IDF can be calculated that how many documents have the term. But I am confused.

For example, Here is the string "JosAH is great. JoshAH rocks" so the TF would be 2/5 and for IDF there are 2 documents and each documents contain JoshAH term. So

Will we just see if that term occur in other documents or we will see how many times it occurs in other documents?

ALSO !

Lets say some how we calculated TF/IDF and the term is "JosAH" and its

tf/idf = 0.232

but we want to see the full document similarity with 2nd document so i have to calculate TF/IDF for each term? then sum it to get actual tf/idf ??? if i am wrong then please correct me

ThanksLast edited by agazerboy; 12-28-2009 at 04:52 AM.

- 12-29-2009, 05:54 PM #8Member
- Join Date
- Dec 2009
- Posts
- 9

- Rep Power
- 0

hellpppppppppppp

- 12-30-2009, 01:25 AM #9
hint: distance matrix + clustering algorithms

(am i right?)"There is no foolproof thing; fools are too smart."

"Why can't you solve my Problem ?"

- 12-31-2009, 02:16 AM #10Member
- Join Date
- Dec 2009
- Posts
- 9

- Rep Power
- 0

Thanks AndreB for your reply. Well I understand why to calculate distance but why should we use clustering algo to measure similarity?

- 12-31-2009, 11:39 AM #11
no, the similarity measure is already given by the distance.

The clustring is what groups similar documents together. Clustering also highly depends on which conclusion you want to achieve.

Introduction to Information Retrieval"There is no foolproof thing; fools are too smart."

"Why can't you solve my Problem ?"

- 12-31-2009, 09:40 PM #12Member
- Join Date
- Dec 2009
- Posts
- 9

- Rep Power
- 0

Thanks for your reply. Can you PM me your email id? I just want to discuss something. For similarity. I am using

TF/IDF to calculate similarity. For example if have following two doc.

Doc A => cat dog

Doc B => dog sparrow

It is normal it's similarity would be 50% but when I calculate its TF/IDF. It is as follow

Tf values for Doc A

dog tf = 0.5

cat tf = 0.5

Tf values for Doc B

dog tf = 0.5

sparrow tf = 0.5

IDF values for Doc A

dog idf = -0.4055

cat idf = 0

IDF values for Doc B

dog idf = -0.4055 ( without +1 formula 0.6931)

sparrow idf = 0

TF/IDF value for Doc A

0.5x-0.4055 + 0.5x0 = -0.20275

TF/IDF values for Doc B

0.5x-0.4055 + 0.5x0 = -0.20275

Now it looks like there is -0.20275 similarity. Is it?

Or am i missing something ?

Or is any kind of next step too? Please tell me so i can calculate that too.

- 12-31-2009, 11:16 PM #13Member
- Join Date
- Dec 2009
- Posts
- 9

- Rep Power
- 0

by the way i am using natural log

i couldn't find any log2 function in java

- 01-02-2010, 12:27 PM #14
i think your calculations of idf are wrong:

idf1_i = log ( (Number of Documents) / (number of documents containing term i) )

or to avoid the devision by zero

idf2_i = log ( (Number of Documents) / (1 + number of documents containing term i) )

with thin in mind

idf1_dog = log (2 / 2) = 0;

or

idf2_dog = log(2 / 3) = -0.17 (!) as we know negative values only can occur if the term is present in all documents we actually have to choose between max(idf_2 , 0) where we also get the value 0.

the tf-idf value is simply the multilication of both

tf-idf_(i,j) = tf_(i,j) * idf_i

there your go ;-)

in my opinion your example is too small and therefore you get strange values. to measure similarity you also need a vector space model for document representation and also use the cosine distance as similarity measurement. the tf-idf are only wights of each vector component (term); not a similarity measure by itself.

if you have questions you can use the private messenging here on the forum"There is no foolproof thing; fools are too smart."

"Why can't you solve my Problem ?"

- 01-04-2010, 07:42 AM #15Member
- Join Date
- Dec 2009
- Posts
- 9

- Rep Power
- 0

Hi AndreB,

Thank you for helping people like me :)

I just need small hint ....

I calculated tf/idf values of two documents. Following is the tf/idf values

1.txt

0.0

0.5

2.txt

0.0

0.5

The documents are like

1.txt = > dog cat

2.txt = > cat elephant

As now I have tf/idf values. Can any body tell me how to use these values to calculate

cosine similarity??

I already read wikipedia and all other tutorial that i should calculate dot product then find distance then divide dot product by distance. I am not good in math. That's why I couldn't understand what they are doing with X,Y :)

If u can just tell me how to calculate using my values. I will understand and implement it.

One more question. In is important both documents should have same number of words?

I tried to send you PM but it said i don't have enough points to send IM :(

Can you PM me your instant id?

Thanks !

- 01-04-2010, 11:07 AM #16
Introduction to Information Retrieval please read chapter 6. it answers all your questions.

"There is no foolproof thing; fools are too smart."

"Why can't you solve my Problem ?"

## Bookmarks