Results 1 to 17 of 17
Like Tree1Likes
  • 1 Post By pbrockway2

Thread: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

  1. #1
    Suzanne M is offline Member
    Join Date
    Dec 2011
    Posts
    7
    Rep Power
    0

    Default Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

    Hi, I am supposed to implement the KMeans clustering algo for numerical and textual data. This is what I have done for data that's been read from a binary file. This code uses packages, I want to know a better way to code KMeans without involvong packages. I would also like some help on how to convert textual data to data that can have KMeans applied directly on them. Finding working codes online is a struggle. Right now m at my wits end.

    Java Code:
    package t_kmeans;
     
    import java.io.*;
    import java.util.*;
    import java.text.DecimalFormat;
    import java.util.List;
     
    public class BasicKMeans {
        static private String fname= "palsy.dat";
        static List attributes = null;
        static int maxNum = 0;
        static DecimalFormat d2 = new DecimalFormat("##.####");
        /**
         * @param args the command line arguments
         */
        public static void main(String[] args) {
     
            // TODO code application logic here
            // Get the data file name
     
            // Read SVMlight format data (*.dat)
            attributes =  readSVMLightFile(fname);
            if(attributes!=null){
                System.out.println("Num of attributes : " + maxNum);
                System.out.println("Num of instances : " + attributes.size());
                Kmeans_cluster(attributes);
            }
        }// end of mains
     
     
          // Implementing K-means
      static void Kmeans_cluster(List attributes)
      {
        // Algorithm
        // 1.Randomly generate k centroid points: C={C1,…,Ck}
        // 2.Partition objects into k nonempty subsets (S1,…,Sk) by assigning each object xito the nearest centroid point by calculating distance metrics d(i,j) between the object xiand the centroid points Cj.
        // 3.Stop if no more new assignment: output C, M(C), (S1,…,Sk)
        // 4.Update k centroid points C1,…,Ck using the k subsets (S1,…,Sk)
        // 5.Go back to Step 2
     
        int k = 2;
        double[][] centroids = new double[k][maxNum];
        double[][] temporycentroids = new double[k][maxNum];
        Vector[] cluster_Array = new Vector[k]; // creating k no of cluster
        boolean flag = false;
     
        // implementation
        if (k > attributes.size())
            System.exit(0);
     
        System.out.println("The Total cluster number: " + k);
     
        //1.Randomly generate k centroid points
     
        centroids = randomCentroids(k);
     
     
            // Step 2 and 3
            // Calculate distance matric between each data point with centroids
        do{
     
             cluster_Array = cluster_Array(centroids,k);
     
            //testing the number of k clusters
            for(int idx = 0; idx < k; idx++)
            {
                System.out.println(cluster_Array[idx].size()+ " record "  + " in Cluster "+ idx );
            } // end of step 2 and 3
     
     
            // searching new k number centroids points
            int attribute_id = 1;
            for (int idx = 0; idx < k; idx++)
            {
                for (int idx1 = 0; idx1 < maxNum; idx1++)
                {
                    temporycentroids [idx][idx1] = calculate_centroids(cluster_Array[idx],attribute_id);
                    attribute_id++;
                }
                attribute_id = 1;
            }
     
            for (int idx = 0; idx < k; idx++)
            {
                System.out.println("centroid point " + idx);
                for (int idx1 = 0; idx1 < maxNum; idx1++)
                {
                    System.out.print(idx1+". ");
                    System.out.println(temporycentroids[idx][idx1]);
                    System.out.print(" ");
                }
                System.out.println();
            }    // end of step 4
     
            // Step 5
            if (compare_centroids(centroids,temporycentroids,k,maxNum) == true)
            {
                flag = true;
                System.out.println("Centroid points are same.");
                for(int idx = 0; idx < k; idx++)
                    System.out.println(cluster_Array[idx].size()+ " record "  + " in Cluster "+ idx );
     
                for (int idx = 0; idx < k; idx++)
                {
                    System.out.println("centroid point " + idx);
                    for (int idx1 = 0; idx1 < maxNum; idx1++)
                    {
                        System.out.print(idx1+". ");
                        System.out.println(temporycentroids[idx][idx1]);
                        System.out.print(" ");
     
                    }
                    System.out.println();
                }
            }
            else    // update centroids
            {
                    System.out.println("centroid points are different.");
                    centroids = temporycentroids;
                        for (int idx = 0; idx < k; idx++)
                        {
                            System.out.println("updated centroid point " + idx);
                            for (int idx1 = 0; idx1 < maxNum; idx1++)
                            {
                                System.out.print(idx1+". ");
                                System.out.println(centroids[idx][idx1]);
                                System.out.print(" ");
     
                            }
                            System.out.println();
                        }
            }
    }while(flag == false);
      }
    	// KmeanCluster
     
     
     
     
    	// calculate_centroids start
     
      static double calculate_centroids(List attributes, int attribute_id){
         int N = attributes.size();
         double centroid = 0.00;
         if(N > 0)
         {
              for(int idx = 0; idx < N; idx++)
              centroid = centroid + Double.valueOf(d2.format(getAttributeValue(attributes,idx,attribute_id)));
              centroid = Double.valueOf(d2.format(centroid / N));
          }
           return centroid;
      }  // end of calculate_centroids function
     
     
    	// start randomCentroids
       static double[][] randomCentroids(int k){
            int[] all = new int[k];
            for (int idx=0; idx<k; idx++){
                all[idx] = 0;  // initialization
            }
            Random rd = new Random();
            int i = 0;
            while (i < k){
                int selectedNumber = rd.nextInt(attributes.size());
                if (adding(all,i,i+1))
                    i++;
            }
     
            System.out.println("The data point which is done random : " );
            for( int idx1=0; idx1<k ; idx1++)
                System.out.println(all[idx1]+1 + " ");
     
     
             double [][] temp = new double [k][maxNum];
     
             for(int row = 0;  row< k; row++)
             {   int attrib_id = 1;
                 for(int col = 0; col < maxNum; col++)
                 {
                     temp[row][col] = Double.valueOf(d2.format(getAttributeValue(attributes,all[row],attrib_id)));
                      attrib_id++;
                  }
             }
             for(int row = 0;  row< k; row++)
             {
                System.out.print("centroid " + row + ": ");
                for(int col = 0; col < maxNum; col++)
                {
                     System.out.println(temp[row][col] + " ");
                 }
                System.out.println();
            }
            return temp;
        }
        // end of randomCentroids function
     
     
    // add function start
       static boolean adding(int[] list, int size, int value){
     
           for ( int idx=0; idx<size; idx++){
               if (list[idx] == value) //Does random value exist or not in list[]?
                       return false; //if same
           }
           list[size] = value;
           return true;
        }
       /*
        * end of add function
        */
     
     
    // start cluster_array function
        static Vector[] cluster_Array(double [][] centroids, int k){
            //create given number of cluster
            Vector[] cluster_Array = new Vector[k];  // create k number of cluster array
            int closet_cluster = 0;       // to keep nearest centroid
            double closet_distance = 0.0;  // to keep nearest distance
            for(int idx=0; idx < k; idx++)
                cluster_Array[idx] = new Vector(); // insert each List into each cluster
     
            for (int idx = 0; idx < attributes.size(); idx++)
            {
                 List fv1 = (List)attributes.get(idx);
                 for (int idx1 = 0; idx1 < k; idx1++)
                 {
                    double[] fv2 = centroids[idx1];
                    double distance = Double.valueOf(d2.format(distanceMetric(fv1, fv2))); // round to 2 decimal double value
                    if (idx1==0)
                    {
                        closet_distance = distance;
                        closet_cluster = idx1;
                    }
                    else if (idx1 > 0)
                    {
                       if(distance <= closet_distance)
                        {
                            closet_distance = distance;
                            closet_cluster = idx1;
                        }
                    }
     
                 }
                 cluster_Array[closet_cluster].add((List)attributes.get(idx));
            }
     
            return cluster_Array;
     }
     
     
        /*
         * Comparing clusters ... old cluster and new one
         * implemented by Thu Thu Tun
         */
     
         static boolean compare_centroids(double [][] c1, double[][] c2, int k, int maxNum){
         for (int idx = 0; idx < k; idx++)
             for (int idx1 = 0; idx1 < maxNum; idx1++)
                if (c1[idx][idx1] != c2[idx][idx1])
                    return false;
     
           System.out.println("temp centroid");
           for (int idx = 0; idx < k; idx++)
            {
                System.out.println("temp centroid point " + idx);
                for (int idx1 = 0; idx1 < maxNum; idx1++)
                {
                    System.out.print(idx1+". ");
                    System.out.println(c1[idx][idx1]);
                    System.out.print(" ");
     
                }
                System.out.println();
            }
     
         return true;
     }
     
     /**
       * Read SVMLight data file into Vector data structure
       *
       * @param fname
       * @return List = a list of feature vectors = [ [ label, [id, value], [id, value],...], .... ]
       *     a feature vector = [ label, [id, value], [id, value],...]
       *     label = int
       *     id = double
       *     value = double
       */
        private static List readSVMLightFile(String fname) {
    		try{
    			List featureV = new Vector();
    			// Reading input by lines:
    			BufferedReader in = new BufferedReader( new FileReader(fname) );
    			String str = new String("");
    			while ((str = in.readLine()) != null){
    			str = str.trim();
    			if(str.length() > 0){
    				String line[];
    				String fs[];
    				line = str.split("\\s",2);
    				String clabel = line[0].trim().replace("+", "");
    				fs = line[1].trim().split("\\s");   // feature list #:v ...
    				List fl = new Vector();
    				fl.add(Integer.parseInt(clabel));
    			for(int idx = 0; idx < fs.length; idx++)
    			{
    				double f[] = {0,0};
    				String attr[] = fs[idx].split(":");
    				int aid = Integer.valueOf(attr[0].trim());
    				if( aid > maxNum) maxNum = aid;
    				f[0] = aid;                             // attribute id
    				f[1] = Double.valueOf(attr[1].trim());    // attribute value
    				fl.add(f);
    			}
    			featureV.add(fl);
            }
          }
          in.close();
          return featureV;
        }
        catch(IOException e){
            System.out.println("Error reading file: " + e);
            return null;
        }
        }//end of read file
     
       /**
       * Calculate Euclidean distance metric between two sparse feature vectors
       * @param i   feature vector = [ label, [id, value], [id, value],...]
       * @param j   feature vector = [ label, [id, value], [id, value],...]
       * modified by Thu Thu Tun
       * @return
       */
      static double distanceMetric(List fv1, double[] fv2){
    		double[] fv = new double[maxNum];
    		for(int idx = 0; idx < maxNum; idx++){
    			fv[idx] = 0;
    		}
    		Iterator itr = fv1.iterator();
    		int k = 0;
    		while(itr.hasNext())
    		{
    		if( k > 0){
    			double f[] = (double[])itr.next();
    			fv[(int)f[0]-1] = f[1];
    		}
    		else
    			itr.next();
    		k++;
        }
     
     
        for (int idx = 0; idx < maxNum; idx++)
        {
          fv[idx] -= fv2[idx];
        }
        double distance = 0;
        for(int idx = 0; idx < maxNum; idx++)
        {
          distance += Double.valueOf(d2.format(fv[idx]*fv[idx]));
        }
     
        return Double.valueOf(d2.format(Math.sqrt(distance)));
        }
     
     
       /**
       * Retrive the attribute value
       * @param attributes
       * @param data_index   0 based: 0,...,dataSize-1
       * @param attribute_id 1 based: 1,...,dimension
       * @return
       */
      static double getAttributeValue(List attributes, int data_index, int attribute_id)
      {
        List fv = (List)attributes.get(data_index);  // feature vector
     
        Iterator itr = fv.iterator();
        int k = 0;
        while(itr.hasNext())
        {
          if( k > 0)
          {
            double f[] = (double[])itr.next();
            if( attribute_id == (int)f[0] )
            {
              return f[1];
            }
          }
          else
            itr.next();
          k++;
        }
     
        return 0;
      }
     
       /**
       * Retrive the class label
       * @param attributes
       * @param data_index   0 based: 0,...,dataSize-1
       * @return
       */
      static int getClassLabel(List attributes, int data_index)
      {
        List fv = (List)attributes.get(data_index);  // feature vector
        return (Integer)fv.get(0);
      }
    }

  2. #2
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,328
    Rep Power
    25

    Default Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

    This code uses packages, I want to know a better way to code KMeans without involving packages.
    What are you referring to? The package statement or the import statements?

    how to convert textual data to data
    Can you show an example of what conversion you are talking about?
    The wrapper classes: Integer and Double have parse methods that will convert String data to numeric primitive data like int and double.

  3. #3
    2by4 is offline Banned
    Join Date
    Dec 2011
    Posts
    143
    Rep Power
    0

    Default Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

    I think the first thing you need to sort out is your data structure.

    Instead of using a Vector of Vectors with heterogeneous elements, have you considered using HashMap<>?

    HashMap has key value pairs and you can search it by key, iterate through its values and check if a key or value is in the map.

    You can use HashMap<>() for featureV with clabel as the key.

    Then the corresponding value in featureV would be another HashMap, with key: aid (attribute id) and value: attribute value.

    That way, you can sort, iterate and query your data cleanly without having to keep track of heterogeneous data.

    You can find the specification for HashMap here.

    It can help make your data structure properly reflect the semantics of the data.
    Last edited by 2by4; 12-18-2011 at 02:29 PM.

  4. #4
    Suzanne M is offline Member
    Join Date
    Dec 2011
    Posts
    7
    Rep Power
    0

    Default Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

    @ Norm: I am referring to the package statement not the import statements.

    @2by4: Can HashMap be used for KMeans? I have never seen it being used. KMeans is a clustering algorithm. I have adapted an online example to create my KMeans algo but I want to adapt it further thus I need help or auggestions on having the code revamped.

    Both: For clustering textual data, I am referring to clustering a list of words in a data set where the words get converted to values which define their cosine similarity that can be used to place the words in clusters. But first most important for me is revamping the code above please suggest how I can proceed with it. Any coding examples would be helpful.

  5. #5
    2by4 is offline Banned
    Join Date
    Dec 2011
    Posts
    143
    Rep Power
    0

    Default Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

    Quote Originally Posted by Suzanne M View Post
    Java Code:
    /**
       * Retrive the attribute value
       * @param attributes
       * @param data_index   0 based: 0,...,dataSize-1
       * @param attribute_id 1 based: 1,...,dimension
       * @return
       */
      static double getAttributeValue(List attributes, int data_index, int attribute_id)
      {
        List fv = (List)attributes.get(data_index);  // feature vector
     
        Iterator itr = fv.iterator();
        int k = 0;
        while(itr.hasNext())
        {
          if( k > 0)
          {
            double f[] = (double[])itr.next();
            if( attribute_id == (int)f[0] )
            {
              return f[1];
            }
          }
          else
            itr.next();
          k++;
        }
     
        return 0;
      }
    I don't know KMeans. I am only going on what I understand of your data structure.

    fv appears to be a list whose first item is a name (which gets skipped in the above method), followed by a series of 2 element arrays. This is very inefficient.

    It would be better to declare fv as

    Map<Integer, Double> fv = new HashMap<>();

    It would be populated with attribute_id, attribute_value pairs, using the fv.put(attribute_id, attribute_value) method.

    Then fv.get(attribute_id) can replace that method quoted above. And it would be much faster because HashMap uses a hashing table. And it is just one line of code.

    fv.remove(attribute_id) and fv.put(attribute_id, attribute_value) can be used to move words from one cluster to another, etc.

    attributes variable would be declared as follows

    Map<String, Map<Integer, Double>> attributes = new HashMap<>();

    It would be populated with cname, fv pairs.

    attributes.get(cname) would get you any cluster by cname.

    for(String c : attributes.keySet()) would iterate over the cnames

    for(Map<Integer, Double> m : attributes.values()) would iterate over the fv

    (Next step would be to move the fv map into an object which implements the interface Comparable and has a method calculateMetric())
    Last edited by 2by4; 12-18-2011 at 09:46 PM.

  6. #6
    doWhile is offline Moderator
    Join Date
    Jul 2010
    Location
    California
    Posts
    1,642
    Rep Power
    7

    Default Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED


  7. #7
    pbrockway2 is offline Moderator
    Join Date
    Feb 2009
    Location
    New Zealand
    Posts
    4,565
    Rep Power
    12

    Default Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

    doWhile likes this.

  8. #8
    Suzanne M is offline Member
    Join Date
    Dec 2011
    Posts
    7
    Rep Power
    0

    Default Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

    @2by4: Thanks a lot for all your suggestions, that's what I was looking for. Will implement them and see what output the code gives.

    @doWhile & pbrockway2: I apologize for cross-posting but I was running out of time and wanted to get as many ideas as possible so that I can resolve my problem fast.

  9. #9
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,328
    Rep Power
    25

    Default Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

    how to convert list to map.
    Can you explain why you need to use a Map?
    How would you get/generate a key value for the items in the list when you put them into a map?
    Given that key, would it be useful for getting the desired value out of the map?

  10. #10
    Suzanne M is offline Member
    Join Date
    Dec 2011
    Posts
    7
    Rep Power
    0

    Default Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

    The accuracy of the outcome when using maps is greater. I was using lists till now but I realized lists were takine me nowhere, then this friend said she implemented maps & got a very accurate outcome. Can someone explain to me how to convert Lists to maps? The steps please for my given code. I am confused.

  11. #11
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,328
    Rep Power
    25

    Default Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

    How would you get/generate a key value for the items in the list when you put them into a map?
    Given that key, would it be useful for getting the desired value out of the map?

  12. #12
    Suzanne M is offline Member
    Join Date
    Dec 2011
    Posts
    7
    Rep Power
    0

    Default Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

    As of now, I don't know how to implement in maps, whatever I have done for lists hence am seeking someone's help to tell me how to do the conversion. If I myself don't know how to do the conversion, how can I discuss key value generation? Please someone tell me how to do the conversion, its an urgent need. Thanks.

  13. #13
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,453
    Blog Entries
    7
    Rep Power
    20

    Default Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

    I shouldn't know how to convert a list to a map because a list stores objects and uses indexes to find them, while maps store tuples (key, value), e.g. how should a list {A, B, C} be converted to a map? As { (A, 0), (B, 1), (C, 3) } or as { (0, A), (1, B), (2, C) } or something completely different?

    kind regards,

    Jos
    cenosillicaphobia: the fear for an empty beer glass

  14. #14
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,328
    Rep Power
    25

    Default Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

    Why do you want your data to be in a Map? How will you access the data once it is in a Map?

  15. #15
    Suzanne M is offline Member
    Join Date
    Dec 2011
    Posts
    7
    Rep Power
    0

    Default Re: How to convert lists to maps???

    I want to put the data in a map bcose someone else implemented the code in maps and got very accurate results. I as of now don't know how to access the key value. Can someone show me how to do the conversion to maps first? I feel that I am going in circles here. Anyone with knowledge pls help.

  16. #16
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,453
    Blog Entries
    7
    Rep Power
    20

    Default Re: How to convert lists to maps???

    Quote Originally Posted by Suzanne M View Post
    I want to put the data in a map bcose someone else implemented the code in maps and got very accurate results. I as of now don't know how to access the key value. Can someone show me how to do the conversion to maps first? I feel that I am going in circles here. Anyone with knowledge pls help.
    Have you read my previous reply #13? Nobody knows what the key and/or value objects are supposed to be.

    kind regards,

    Jos
    cenosillicaphobia: the fear for an empty beer glass

  17. #17
    Norm's Avatar
    Norm is online now Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,328
    Rep Power
    25

    Default Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

    someone else implemented the code in maps and got very accurate results
    Ask the "someone" what they did.

    how to do the conversion to maps
    One very simple way would be to use an index number for the key. Completely useless, but that would put the data into a Map;

    Define theIndex as:
    int theIndex = <STARTINGVALUE>;

    theMap.put(new Integer(theIndex), theData); // Put the data into the Map with theIndex as key

    theIndex++; // increment theIndex to generate a unique new key

Similar Threads

  1. Path solving algorithm/method help needed!
    By Zee Best in forum Advanced Java
    Replies: 3
    Last Post: 10-18-2011, 01:32 AM
  2. help i'm desperate
    By AniMaind in forum New To Java
    Replies: 17
    Last Post: 01-08-2011, 01:05 AM
  3. Replies: 0
    Last Post: 08-20-2010, 12:24 PM
  4. k-means algorithm for document clustering
    By Ramya Selvaraj in forum Forum Guides
    Replies: 0
    Last Post: 02-06-2009, 06:11 AM
  5. Replies: 1
    Last Post: 10-14-2008, 08:52 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •