# Thread: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

1. Member Join Date
Dec 2011
Posts
7
Rep Power
0

## Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

Hi, I am supposed to implement the KMeans clustering algo for numerical and textual data. This is what I have done for data that's been read from a binary file. This code uses packages, I want to know a better way to code KMeans without involvong packages. I would also like some help on how to convert textual data to data that can have KMeans applied directly on them. Finding working codes online is a struggle. Right now m at my wits end.

Java Code:
```package t_kmeans;

import java.io.*;
import java.util.*;
import java.text.DecimalFormat;
import java.util.List;

public class BasicKMeans {
static private String fname= "palsy.dat";
static List attributes = null;
static int maxNum = 0;
static DecimalFormat d2 = new DecimalFormat("##.####");
/**
* @param args the command line arguments
*/
public static void main(String[] args) {

// TODO code application logic here
// Get the data file name

// Read SVMlight format data (*.dat)
attributes =  readSVMLightFile(fname);
if(attributes!=null){
System.out.println("Num of attributes : " + maxNum);
System.out.println("Num of instances : " + attributes.size());
Kmeans_cluster(attributes);
}
}// end of mains

// Implementing K-means
static void Kmeans_cluster(List attributes)
{
// Algorithm
// 1.Randomly generate k centroid points: C={C1,â¦,Ck}
// 2.Partition objects into k nonempty subsets (S1,â¦,Sk) by assigning each object xito the nearest centroid point by calculating distance metrics d(i,j) between the object xiand the centroid points Cj.
// 3.Stop if no more new assignment: output C, M(C), (S1,â¦,Sk)
// 4.Update k centroid points C1,â¦,Ck using the k subsets (S1,â¦,Sk)
// 5.Go back to Step 2

int k = 2;
double[][] centroids = new double[k][maxNum];
double[][] temporycentroids = new double[k][maxNum];
Vector[] cluster_Array = new Vector[k]; // creating k no of cluster
boolean flag = false;

// implementation
if (k > attributes.size())
System.exit(0);

System.out.println("The Total cluster number: " + k);

//1.Randomly generate k centroid points

centroids = randomCentroids(k);

// Step 2 and 3
// Calculate distance matric between each data point with centroids
do{

cluster_Array = cluster_Array(centroids,k);

//testing the number of k clusters
for(int idx = 0; idx < k; idx++)
{
System.out.println(cluster_Array[idx].size()+ " record "  + " in Cluster "+ idx );
} // end of step 2 and 3

// searching new k number centroids points
int attribute_id = 1;
for (int idx = 0; idx < k; idx++)
{
for (int idx1 = 0; idx1 < maxNum; idx1++)
{
temporycentroids [idx][idx1] = calculate_centroids(cluster_Array[idx],attribute_id);
attribute_id++;
}
attribute_id = 1;
}

for (int idx = 0; idx < k; idx++)
{
System.out.println("centroid point " + idx);
for (int idx1 = 0; idx1 < maxNum; idx1++)
{
System.out.print(idx1+". ");
System.out.println(temporycentroids[idx][idx1]);
System.out.print(" ");
}
System.out.println();
}    // end of step 4

// Step 5
if (compare_centroids(centroids,temporycentroids,k,maxNum) == true)
{
flag = true;
System.out.println("Centroid points are same.");
for(int idx = 0; idx < k; idx++)
System.out.println(cluster_Array[idx].size()+ " record "  + " in Cluster "+ idx );

for (int idx = 0; idx < k; idx++)
{
System.out.println("centroid point " + idx);
for (int idx1 = 0; idx1 < maxNum; idx1++)
{
System.out.print(idx1+". ");
System.out.println(temporycentroids[idx][idx1]);
System.out.print(" ");

}
System.out.println();
}
}
else    // update centroids
{
System.out.println("centroid points are different.");
centroids = temporycentroids;
for (int idx = 0; idx < k; idx++)
{
System.out.println("updated centroid point " + idx);
for (int idx1 = 0; idx1 < maxNum; idx1++)
{
System.out.print(idx1+". ");
System.out.println(centroids[idx][idx1]);
System.out.print(" ");

}
System.out.println();
}
}
}while(flag == false);
}
// KmeanCluster

// calculate_centroids start

static double calculate_centroids(List attributes, int attribute_id){
int N = attributes.size();
double centroid = 0.00;
if(N > 0)
{
for(int idx = 0; idx < N; idx++)
centroid = centroid + Double.valueOf(d2.format(getAttributeValue(attributes,idx,attribute_id)));
centroid = Double.valueOf(d2.format(centroid / N));
}
return centroid;
}  // end of calculate_centroids function

// start randomCentroids
static double[][] randomCentroids(int k){
int[] all = new int[k];
for (int idx=0; idx<k; idx++){
all[idx] = 0;  // initialization
}
Random rd = new Random();
int i = 0;
while (i < k){
int selectedNumber = rd.nextInt(attributes.size());
if (adding(all,i,i+1))
i++;
}

System.out.println("The data point which is done random : " );
for( int idx1=0; idx1<k ; idx1++)
System.out.println(all[idx1]+1 + " ");

double [][] temp = new double [k][maxNum];

for(int row = 0;  row< k; row++)
{   int attrib_id = 1;
for(int col = 0; col < maxNum; col++)
{
temp[row][col] = Double.valueOf(d2.format(getAttributeValue(attributes,all[row],attrib_id)));
attrib_id++;
}
}
for(int row = 0;  row< k; row++)
{
System.out.print("centroid " + row + ": ");
for(int col = 0; col < maxNum; col++)
{
System.out.println(temp[row][col] + " ");
}
System.out.println();
}
return temp;
}
// end of randomCentroids function

// add function start
static boolean adding(int[] list, int size, int value){

for ( int idx=0; idx<size; idx++){
if (list[idx] == value) //Does random value exist or not in list[]?
return false; //if same
}
list[size] = value;
return true;
}
/*
* end of add function
*/

// start cluster_array function
static Vector[] cluster_Array(double [][] centroids, int k){
//create given number of cluster
Vector[] cluster_Array = new Vector[k];  // create k number of cluster array
int closet_cluster = 0;       // to keep nearest centroid
double closet_distance = 0.0;  // to keep nearest distance
for(int idx=0; idx < k; idx++)
cluster_Array[idx] = new Vector(); // insert each List into each cluster

for (int idx = 0; idx < attributes.size(); idx++)
{
List fv1 = (List)attributes.get(idx);
for (int idx1 = 0; idx1 < k; idx1++)
{
double[] fv2 = centroids[idx1];
double distance = Double.valueOf(d2.format(distanceMetric(fv1, fv2))); // round to 2 decimal double value
if (idx1==0)
{
closet_distance = distance;
closet_cluster = idx1;
}
else if (idx1 > 0)
{
if(distance <= closet_distance)
{
closet_distance = distance;
closet_cluster = idx1;
}
}

}
cluster_Array[closet_cluster].add((List)attributes.get(idx));
}

return cluster_Array;
}

/*
* Comparing clusters ... old cluster and new one
* implemented by Thu Thu Tun
*/

static boolean compare_centroids(double [][] c1, double[][] c2, int k, int maxNum){
for (int idx = 0; idx < k; idx++)
for (int idx1 = 0; idx1 < maxNum; idx1++)
if (c1[idx][idx1] != c2[idx][idx1])
return false;

System.out.println("temp centroid");
for (int idx = 0; idx < k; idx++)
{
System.out.println("temp centroid point " + idx);
for (int idx1 = 0; idx1 < maxNum; idx1++)
{
System.out.print(idx1+". ");
System.out.println(c1[idx][idx1]);
System.out.print(" ");

}
System.out.println();
}

return true;
}

/**
* Read SVMLight data file into Vector data structure
*
* @param fname
* @return List = a list of feature vectors = [ [ label, [id, value], [id, value],...], .... ]
*     a feature vector = [ label, [id, value], [id, value],...]
*     label = int
*     id = double
*     value = double
*/
private static List readSVMLightFile(String fname) {
try{
List featureV = new Vector();
// Reading input by lines:
BufferedReader in = new BufferedReader( new FileReader(fname) );
String str = new String("");
while ((str = in.readLine()) != null){
str = str.trim();
if(str.length() > 0){
String line[];
String fs[];
line = str.split("\\s",2);
String clabel = line.trim().replace("+", "");
fs = line.trim().split("\\s");   // feature list #:v ...
List fl = new Vector();
fl.add(Integer.parseInt(clabel));
for(int idx = 0; idx < fs.length; idx++)
{
double f[] = {0,0};
String attr[] = fs[idx].split(":");
int aid = Integer.valueOf(attr.trim());
if( aid > maxNum) maxNum = aid;
f = aid;                             // attribute id
f = Double.valueOf(attr.trim());    // attribute value
fl.add(f);
}
featureV.add(fl);
}
}
in.close();
return featureV;
}
catch(IOException e){
System.out.println("Error reading file: " + e);
return null;
}
}//end of read file

/**
* Calculate Euclidean distance metric between two sparse feature vectors
* @param i   feature vector = [ label, [id, value], [id, value],...]
* @param j   feature vector = [ label, [id, value], [id, value],...]
* modified by Thu Thu Tun
* @return
*/
static double distanceMetric(List fv1, double[] fv2){
double[] fv = new double[maxNum];
for(int idx = 0; idx < maxNum; idx++){
fv[idx] = 0;
}
Iterator itr = fv1.iterator();
int k = 0;
while(itr.hasNext())
{
if( k > 0){
double f[] = (double[])itr.next();
fv[(int)f-1] = f;
}
else
itr.next();
k++;
}

for (int idx = 0; idx < maxNum; idx++)
{
fv[idx] -= fv2[idx];
}
double distance = 0;
for(int idx = 0; idx < maxNum; idx++)
{
distance += Double.valueOf(d2.format(fv[idx]*fv[idx]));
}

return Double.valueOf(d2.format(Math.sqrt(distance)));
}

/**
* Retrive the attribute value
* @param attributes
* @param data_index   0 based: 0,...,dataSize-1
* @param attribute_id 1 based: 1,...,dimension
* @return
*/
static double getAttributeValue(List attributes, int data_index, int attribute_id)
{
List fv = (List)attributes.get(data_index);  // feature vector

Iterator itr = fv.iterator();
int k = 0;
while(itr.hasNext())
{
if( k > 0)
{
double f[] = (double[])itr.next();
if( attribute_id == (int)f )
{
return f;
}
}
else
itr.next();
k++;
}

return 0;
}

/**
* Retrive the class label
* @param attributes
* @param data_index   0 based: 0,...,dataSize-1
* @return
*/
static int getClassLabel(List attributes, int data_index)
{
List fv = (List)attributes.get(data_index);  // feature vector
return (Integer)fv.get(0);
}
}```  Reply With Quote

2. ## Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

This code uses packages, I want to know a better way to code KMeans without involving packages.
What are you referring to? The package statement or the import statements?

how to convert textual data to data
Can you show an example of what conversion you are talking about?
The wrapper classes: Integer and Double have parse methods that will convert String data to numeric primitive data like int and double.  Reply With Quote

3. Banned Join Date
Dec 2011
Posts
143
Rep Power
0

## Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

I think the first thing you need to sort out is your data structure.

Instead of using a Vector of Vectors with heterogeneous elements, have you considered using HashMap<>?

HashMap has key value pairs and you can search it by key, iterate through its values and check if a key or value is in the map.

You can use HashMap<>() for featureV with clabel as the key.

Then the corresponding value in featureV would be another HashMap, with key: aid (attribute id) and value: attribute value.

That way, you can sort, iterate and query your data cleanly without having to keep track of heterogeneous data.

You can find the specification for HashMap here.

It can help make your data structure properly reflect the semantics of the data.
Last edited by 2by4; 12-18-2011 at 02:29 PM.  Reply With Quote

4. Member Join Date
Dec 2011
Posts
7
Rep Power
0

## Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

@ Norm: I am referring to the package statement not the import statements.

@2by4: Can HashMap be used for KMeans? I have never seen it being used. KMeans is a clustering algorithm. I have adapted an online example to create my KMeans algo but I want to adapt it further thus I need help or auggestions on having the code revamped.

Both: For clustering textual data, I am referring to clustering a list of words in a data set where the words get converted to values which define their cosine similarity that can be used to place the words in clusters. But first most important for me is revamping the code above please suggest how I can proceed with it. Any coding examples would be helpful.  Reply With Quote

5. Banned Join Date
Dec 2011
Posts
143
Rep Power
0

## Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED Originally Posted by Suzanne M Java Code:
```/**
* Retrive the attribute value
* @param attributes
* @param data_index   0 based: 0,...,dataSize-1
* @param attribute_id 1 based: 1,...,dimension
* @return
*/
static double getAttributeValue(List attributes, int data_index, int attribute_id)
{
List fv = (List)attributes.get(data_index);  // feature vector

Iterator itr = fv.iterator();
int k = 0;
while(itr.hasNext())
{
if( k > 0)
{
double f[] = (double[])itr.next();
if( attribute_id == (int)f )
{
return f;
}
}
else
itr.next();
k++;
}

return 0;
}```
I don't know KMeans. I am only going on what I understand of your data structure.

fv appears to be a list whose first item is a name (which gets skipped in the above method), followed by a series of 2 element arrays. This is very inefficient.

It would be better to declare fv as

Map<Integer, Double> fv = new HashMap<>();

It would be populated with attribute_id, attribute_value pairs, using the fv.put(attribute_id, attribute_value) method.

Then fv.get(attribute_id) can replace that method quoted above. And it would be much faster because HashMap uses a hashing table. And it is just one line of code.

fv.remove(attribute_id) and fv.put(attribute_id, attribute_value) can be used to move words from one cluster to another, etc.

attributes variable would be declared as follows

Map<String, Map<Integer, Double>> attributes = new HashMap<>();

It would be populated with cname, fv pairs.

attributes.get(cname) would get you any cluster by cname.

for(String c : attributes.keySet()) would iterate over the cnames

for(Map<Integer, Double> m : attributes.values()) would iterate over the fv

(Next step would be to move the fv map into an object which implements the interface Comparable and has a method calculateMetric())
Last edited by 2by4; 12-18-2011 at 09:46 PM.  Reply With Quote

6. Moderator  Join Date
Jul 2010
Location
California
Posts
1,641
Rep Power
12

## Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED  Reply With Quote

7. Moderator   Join Date
Feb 2009
Location
New Zealand
Posts
4,717
Rep Power
17

## Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED  Reply With Quote

8. Member Join Date
Dec 2011
Posts
7
Rep Power
0

## Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

@2by4: Thanks a lot for all your suggestions, that's what I was looking for. Will implement them and see what output the code gives.

@doWhile & pbrockway2: I apologize for cross-posting but I was running out of time and wanted to get as many ideas as possible so that I can resolve my problem fast.  Reply With Quote

9. ## Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

how to convert list to map.
Can you explain why you need to use a Map?
How would you get/generate a key value for the items in the list when you put them into a map?
Given that key, would it be useful for getting the desired value out of the map?  Reply With Quote

10. Member Join Date
Dec 2011
Posts
7
Rep Power
0

## Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

The accuracy of the outcome when using maps is greater. I was using lists till now but I realized lists were takine me nowhere, then this friend said she implemented maps & got a very accurate outcome. Can someone explain to me how to convert Lists to maps? The steps please for my given code. I am confused.  Reply With Quote

11. ## Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

How would you get/generate a key value for the items in the list when you put them into a map?
Given that key, would it be useful for getting the desired value out of the map?  Reply With Quote

12. Member Join Date
Dec 2011
Posts
7
Rep Power
0

## Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

As of now, I don't know how to implement in maps, whatever I have done for lists hence am seeking someone's help to tell me how to do the conversion. If I myself don't know how to do the conversion, how can I discuss key value generation? Please someone tell me how to do the conversion, its an urgent need. Thanks.  Reply With Quote

13. ## Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

I shouldn't know how to convert a list to a map because a list stores objects and uses indexes to find them, while maps store tuples (key, value), e.g. how should a list {A, B, C} be converted to a map? As { (A, 0), (B, 1), (C, 3) } or as { (0, A), (1, B), (2, C) } or something completely different?

kind regards,

Jos  Reply With Quote

14. ## Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

Why do you want your data to be in a Map? How will you access the data once it is in a Map?  Reply With Quote

15. Member Join Date
Dec 2011
Posts
7
Rep Power
0

## Re: How to convert lists to maps???

I want to put the data in a map bcose someone else implemented the code in maps and got very accurate results. I as of now don't know how to access the key value. Can someone show me how to do the conversion to maps first? I feel that I am going in circles here. Anyone with knowledge pls help.  Reply With Quote

16. ## Re: How to convert lists to maps??? Originally Posted by Suzanne M I want to put the data in a map bcose someone else implemented the code in maps and got very accurate results. I as of now don't know how to access the key value. Can someone show me how to do the conversion to maps first? I feel that I am going in circles here. Anyone with knowledge pls help.
Have you read my previous reply #13? Nobody knows what the key and/or value objects are supposed to be.

kind regards,

Jos  Reply With Quote

17. ## Re: Stuck with KMeans Clustering Algorithm DESPERATE HELP NEEDED

someone else implemented the code in maps and got very accurate results
Ask the "someone" what they did.

how to do the conversion to maps
One very simple way would be to use an index number for the key. Completely useless, but that would put the data into a Map;

Define theIndex as:
int theIndex = <STARTINGVALUE>;

theMap.put(new Integer(theIndex), theData); // Put the data into the Map with theIndex as key

theIndex++; // increment theIndex to generate a unique new key  Reply With Quote

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•