Results 1 to 16 of 16

Thread: tf idf in java

  1. #1
    agazerboy is offline Member
    Join Date
    Dec 2009
    Posts
    9
    Rep Power
    0

    Default tf idf in java

    Hi All,
    I am looking for a simple java class that can compute tf-idf calculation. I want to do similarity test on 2 documents. I found so many BIG API who used tf-idf class. I do not want to use a big jar file, just to do my simple test. Please help !
    Or atlest if some one can tell me how to find TF? and IDF? I will calculate the results :)
    OR
    If you can tell me some good java tutorial for this.
    Please do not tell me for looking google, I already did for 3 days and couldn't find any thing :(
    Please do not refer me to Lucene

  2. #2
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,560
    Blog Entries
    7
    Rep Power
    21

    Default

    For a simple explanation of this all look here.

    kind regards,

    Jos

  3. #3
    agazerboy is offline Member
    Join Date
    Dec 2009
    Posts
    9
    Rep Power
    0

    Default

    Thanks JoSAH for your reply. But i am sure you didn't read my post. I had spend 3 days for searching this. Do you think i am dumb who don't even know about wiki? I already saw it
    anyways thanks

    anybody else?

  4. #4
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,560
    Blog Entries
    7
    Rep Power
    21

    Default

    Quote Originally Posted by agazerboy View Post
    Thanks JoSAH for your reply. But i am sure you didn't read my post. I had spend 3 days for searching this. Do you think i am dumb who don't even know about wiki? I already saw it
    anyways thanks

    anybody else?
    Yes I did read your post and I did read that wiki page; that's why I recommended it. It has a fine explanation of the (simple) math for the subject. What is wrong with it? Did you expect spoonfeeding code? The math isn't rocket science and can be easily implemented by yourself.

    kind regards,

    Jos
    Last edited by JosAH; 12-26-2009 at 08:57 AM.

  5. #5
    agazerboy is offline Member
    Join Date
    Dec 2009
    Posts
    9
    Rep Power
    0

    Default

    Hi JosAH,
    Thank you for your quick reply. No I do not need spoonfeeding. I am beginer so i want to learn things and how to implement them too. Ok let me explain
    I know what is TF and IDF and how to calculate them :). But i was looking for the tip how to implement them in java. For example, I have two folders A & b and I want to read file from A to compare them with folder B files to see how much similar they are. Lets say some how I did calculate it. I got some TF/IDF. It would be like 10000 TF/IDF matrix. How would I know which file was more simlar to whom ? ( Sorry If i am not clear, please ask me ). Should i use clustoring techniques or what?

  6. #6
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,560
    Blog Entries
    7
    Rep Power
    21

    Default

    Quote Originally Posted by agazerboy View Post
    Hi JosAH,
    Thank you for your quick reply. No I do not need spoonfeeding. I am beginer so i want to learn things and how to implement them too. Ok let me explain
    I know what is TF and IDF and how to calculate them :). But i was looking for the tip how to implement them in java. For example, I have two folders A & b and I want to read file from A to compare them with folder B files to see how much similar they are. Lets say some how I did calculate it. I got some TF/IDF. It would be like 10000 TF/IDF matrix. How would I know which file was more simlar to whom ? ( Sorry If i am not clear, please ask me ). Should i use clustoring techniques or what?
    Note that the entity 'statistical similarity' between two documents is defined for a single word/sentence/phrase/term or whatever. If a document writes about, say, cats and dogs, while another document writes solely about cats and mice both documents would be quite similar w.r.t. the word 'cat' but will have no similarity at all when you search for 'dog' or 'mouse'.

    If you want to compare two sets of documents for one 'item' you indeed end up with a matrix where a document from one set represents a row of the matrix and a document of the other set represents a column of the matrix.

    A Set<String, Integer> could handle the mapping from a document name to an index value. You have to compute the similarity for all document combinations AxB to fill in the matrix value. And all that for one single 'item'.

    kind regards,

    Jos

  7. #7
    agazerboy is offline Member
    Join Date
    Dec 2009
    Posts
    9
    Rep Power
    0

    Default

    Hi JosAH,
    Thank you for your reply. It helped me alot to make tf-idf function in java. I made tf but I have one question. As on wiki they wrote IDF can be calculated that how many documents have the term. But I am confused.

    For example, Here is the string "JosAH is great. JoshAH rocks" so the TF would be 2/5 and for IDF there are 2 documents and each documents contain JoshAH term. So
    Will we just see if that term occur in other documents or we will see how many times it occurs in other documents?

    ALSO !

    Lets say some how we calculated TF/IDF and the term is "JosAH" and its
    tf/idf = 0.232
    but we want to see the full document similarity with 2nd document so i have to calculate TF/IDF for each term? then sum it to get actual tf/idf ??? if i am wrong then please correct me
    Thanks
    Last edited by agazerboy; 12-28-2009 at 03:52 AM.

  8. #8
    agazerboy is offline Member
    Join Date
    Dec 2009
    Posts
    9
    Rep Power
    0

    Default

    hellpppppppppppp

  9. #9
    AndreB's Avatar
    AndreB is offline Senior Member
    Join Date
    Dec 2009
    Location
    Stuttgart, Germany
    Posts
    114
    Rep Power
    0

    Default

    hint: distance matrix + clustering algorithms
    (am i right?)
    "There is no foolproof thing; fools are too smart."
    "Why can't you solve my Problem ?"

  10. #10
    agazerboy is offline Member
    Join Date
    Dec 2009
    Posts
    9
    Rep Power
    0

    Default

    Thanks AndreB for your reply. Well I understand why to calculate distance but why should we use clustering algo to measure similarity?

  11. #11
    AndreB's Avatar
    AndreB is offline Senior Member
    Join Date
    Dec 2009
    Location
    Stuttgart, Germany
    Posts
    114
    Rep Power
    0

    Default

    no, the similarity measure is already given by the distance.
    The clustring is what groups similar documents together. Clustering also highly depends on which conclusion you want to achieve.
    Introduction to Information Retrieval
    "There is no foolproof thing; fools are too smart."
    "Why can't you solve my Problem ?"

  12. #12
    agazerboy is offline Member
    Join Date
    Dec 2009
    Posts
    9
    Rep Power
    0

    Default

    Thanks for your reply. Can you PM me your email id? I just want to discuss something. For similarity. I am using

    TF/IDF to calculate similarity. For example if have following two doc.

    Doc A => cat dog
    Doc B => dog sparrow

    It is normal it's similarity would be 50% but when I calculate its TF/IDF. It is as follow

    Tf values for Doc A
    dog tf = 0.5
    cat tf = 0.5

    Tf values for Doc B

    dog tf = 0.5
    sparrow tf = 0.5

    IDF values for Doc A
    dog idf = -0.4055
    cat idf = 0

    IDF values for Doc B
    dog idf = -0.4055 ( without +1 formula 0.6931)
    sparrow idf = 0

    TF/IDF value for Doc A
    0.5x-0.4055 + 0.5x0 = -0.20275

    TF/IDF values for Doc B
    0.5x-0.4055 + 0.5x0 = -0.20275

    Now it looks like there is -0.20275 similarity. Is it?
    Or am i missing something ?
    Or is any kind of next step too? Please tell me so i can calculate that too.

  13. #13
    agazerboy is offline Member
    Join Date
    Dec 2009
    Posts
    9
    Rep Power
    0

    Default

    by the way i am using natural log
    i couldn't find any log2 function in java

  14. #14
    AndreB's Avatar
    AndreB is offline Senior Member
    Join Date
    Dec 2009
    Location
    Stuttgart, Germany
    Posts
    114
    Rep Power
    0

    Default

    i think your calculations of idf are wrong:

    idf1_i = log ( (Number of Documents) / (number of documents containing term i) )
    or to avoid the devision by zero
    idf2_i = log ( (Number of Documents) / (1 + number of documents containing term i) )

    with thin in mind
    idf1_dog = log (2 / 2) = 0;
    or
    idf2_dog = log(2 / 3) = -0.17 (!) as we know negative values only can occur if the term is present in all documents we actually have to choose between max(idf_2 , 0) where we also get the value 0.

    the tf-idf value is simply the multilication of both
    tf-idf_(i,j) = tf_(i,j) * idf_i

    there your go ;-)

    in my opinion your example is too small and therefore you get strange values. to measure similarity you also need a vector space model for document representation and also use the cosine distance as similarity measurement. the tf-idf are only wights of each vector component (term); not a similarity measure by itself.

    if you have questions you can use the private messenging here on the forum
    "There is no foolproof thing; fools are too smart."
    "Why can't you solve my Problem ?"

  15. #15
    agazerboy is offline Member
    Join Date
    Dec 2009
    Posts
    9
    Rep Power
    0

    Default

    Hi AndreB,
    Thank you for helping people like me :)
    I just need small hint ....

    I calculated tf/idf values of two documents. Following is the tf/idf values
    1.txt
    0.0
    0.5
    2.txt
    0.0
    0.5

    The documents are like
    1.txt = > dog cat
    2.txt = > cat elephant

    As now I have tf/idf values. Can any body tell me how to use these values to calculate
    cosine similarity??

    I already read wikipedia and all other tutorial that i should calculate dot product then find distance then divide dot product by distance. I am not good in math. That's why I couldn't understand what they are doing with X,Y :)

    If u can just tell me how to calculate using my values. I will understand and implement it.

    One more question. In is important both documents should have same number of words?

    I tried to send you PM but it said i don't have enough points to send IM :(
    Can you PM me your instant id?
    Thanks !

  16. #16
    AndreB's Avatar
    AndreB is offline Senior Member
    Join Date
    Dec 2009
    Location
    Stuttgart, Germany
    Posts
    114
    Rep Power
    0

    Default

    Introduction to Information Retrieval please read chapter 6. it answers all your questions.
    "There is no foolproof thing; fools are too smart."
    "Why can't you solve my Problem ?"

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •