How to automatically classify data in a large database
I am supposed to take data from wikipeadia dump or freebase dump or dbpedia.
I am then supposed write code that gives as output what every datum in that database is. eg: name of a person or a business, address,... It does not matter in what language i write the code but, Iím only familiar with C, C++, Java and Python. Java is my preferred language.
Those databases have all types of data: title, person name, address, social security, phone...
I have three questions:
1) Since I have used machine learning a lot, I have decided to use a machine learning approach.
I have started looking into WEKA, a Java machine learning toolbox. It however has only a GPL license. Is there another tool box that i can use in commercial product.
2)The problem I am facing with a machine learning approach is that I don't know what features to use. All I can think of right now is: the length of the datum, the number of string characters it has, the number of integer character it has.
This is very little with all the type of data those databases have. Regular expression seems to not be a solution for this type of project.
2)Is there another approach I can use? I mean, is machine learning the only approach?
Thank you for your help.