Results 1 to 4 of 4
Thread: "Similarity" algorithm
- 10-14-2010, 06:34 AM #1
Member
- Join Date
- Oct 2010
- Location
- New Delhi
- Posts
- 2
- Rep Power
- 0
"Similarity" algorithm
Hola!
I've been working on a commutation software called jCommute for the past few days. The aim of our project is to display the various bus/metro routes from the source and destination entered by the user.
I've started the programming. So far it's working alright but for this one niggle.
I don't have a well defined database of the various bus-routes and the intermediate bus-stops. I'm using an excel spreadsheet which I stumbled across on the transport corporation's site, that has all the aforesaid, but the data present is not well defined and consistent.
For instance, consider a place called "Connought Place"(I'm in Delhi, India). In some bus routes, its mentioned as "Cnt. Place" while in others its "Conought Place" while in some others is simply "C.P". I've made two jcomboboxes for the source and destination, and a friend of mine helped me in extracting and populating these comboboxes with all the possible location on Delhi's bus routes using VBScripting on the excel file.However, this has been the major roadblock of the project. The user, at the time of selection doesn't know if he has to enter "C.P" or "Cnt. Place" or "Connought Place" as either the source or destination. For if he enters, say, "C.P" as the source and say, "New delhi railway station"(which itself has 15 different spellings available), he's unable to find the bus route beacause the desired bus route has "Connought Place" as the source station in the excel file.
Is their any algorithm which disregards these minor spelling mistakes and perceives all these various form of the same place as equal?
If there isn't, what other alternatives do I have?
As the status quo, i'm using the "contains()" function to match the source/destination entered by the user with the entries for these places in my excel spreadsheet. I've used the jexcel API to link to the excel spreadsheet.
Please Help!!Last edited by maverickv; 10-14-2010 at 06:40 AM.
- 10-14-2010, 07:03 AM #2
No, there's no artificial intelligence algorithm that can compensate for human stupidity ;) Minor spelling mistakes (like your Connought Place for Connaught Place) may be detectable using a spell check algorithm, but that would also be likely to treat Greater Kailash I and Greater Kailash II as the same.
My suggestion is to extract the spreadsheet data into a relational database (You can use JavaDB which comes free with the JDK) and manually normalize the data.
Using the String#contains(...) method has the same drawback in respect of GK-I and GK-II and other places whose names already contain a sequence of characters that forms a valid name of another place.
It's rather obvious that this is an academic exercise, so you probably shouldn't be bound to real-world data that was never intended to be used in the way you're using it. In that context, I would just reduce the data to a smaller subset and normalize it in situ.
db
- 10-14-2010, 07:23 AM #3
Member
- Join Date
- Oct 2010
- Location
- New Delhi
- Posts
- 2
- Rep Power
- 0
- 10-14-2010, 08:29 AM #4
Moderator
- Join Date
- Feb 2009
- Location
- New Zealand
- Posts
- 4,544
- Rep Power
- 11
You know - I'm just thinking aloud because I have no clue about how it would be done - we recognise "Connaught" vs "Connought" as a possible error because of the lexical similarity. It relatively easy to quantify this (number of letter differences or whatever) and on that basis at least flag it as a possible error.
(Then we can defer to the knowledgable and careful likes of db.)
But as Darryl points out this is going to flag a lot of things as possible equivalents when they aren't: the GK effect. However things might be better if more information were used. I'm thinking particularly of topological characteristics of a given entry. There is little chance of confusing two "Main St"s despite identity of their lexical form because they join (or for places are adjacent to) quite different and altogether distinct places.
So I'm wondering if you could quantify these topological differences somehow. And on that basis more accurately identify possible equivalents.
(I've just realised that a recent youtube video I saw might have suggested this: it was about people taking maps and performing automatic registration - ie map to geolocation - based on the topology of street intersections.)
Similar Threads
-
connection = DriverManager.getConnection(DATABASE_URL,'"+userid +"','"+password+"');
By renu in forum New To JavaReplies: 3Last Post: 10-12-2010, 04:21 PM -
Java, Military Format using "/" and "%" Operator!!
By sk8rsam77 in forum New To JavaReplies: 11Last Post: 02-26-2010, 03:03 AM -
How to change my form design from "metal" to "nimbus" in Netbeans 6.7.1?
By mlibot in forum New To JavaReplies: 1Last Post: 01-21-2010, 09:20 AM -
MoneyOut.println("It took you (whats wrong?>",year,"<WW?) years to repay the loan")
By soc86 in forum New To JavaReplies: 2Last Post: 01-24-2009, 06:56 PM -
the dollar sign "$", prints like any other normal char in java like "a" or "*" ?
By lse123 in forum New To JavaReplies: 1Last Post: 10-20-2008, 07:35 AM


LinkBack URL
About LinkBacks
Reply With Quote

Bookmarks