I've been working on a commutation software called jCommute for the past few days. The aim of our project is to display the various bus/metro routes from the source and destination entered by the user.
I've started the programming. So far it's working alright but for this one niggle.
I don't have a well defined database of the various bus-routes and the intermediate bus-stops. I'm using an excel spreadsheet which I stumbled across on the transport corporation's site, that has all the aforesaid, but the data present is not well defined and consistent.
For instance, consider a place called "Connought Place"(I'm in Delhi, India). In some bus routes, its mentioned as "Cnt. Place" while in others its "Conought Place" while in some others is simply "C.P". I've made two jcomboboxes for the source and destination, and a friend of mine helped me in extracting and populating these comboboxes with all the possible location on Delhi's bus routes using VBScripting on the excel file.However, this has been the major roadblock of the project. The user, at the time of selection doesn't know if he has to enter "C.P" or "Cnt. Place" or "Connought Place" as either the source or destination. For if he enters, say, "C.P" as the source and say, "New delhi railway station"(which itself has 15 different spellings available), he's unable to find the bus route beacause the desired bus route has "Connought Place" as the source station in the excel file.
Is their any algorithm which disregards these minor spelling mistakes and perceives all these various form of the same place as equal?
If there isn't, what other alternatives do I have?
As the status quo, i'm using the "contains()" function to match the source/destination entered by the user with the entries for these places in my excel spreadsheet. I've used the jexcel API to link to the excel spreadsheet.
No, there's no artificial intelligence algorithm that can compensate for human stupidity ;) Minor spelling mistakes (like your Connought Place for Connaught Place) may be detectable using a spell check algorithm, but that would also be likely to treat Greater Kailash I and Greater Kailash II as the same.
My suggestion is to extract the spreadsheet data into a relational database (You can use JavaDB which comes free with the JDK) and manually normalize the data.
Using the String#contains(...) method has the same drawback in respect of GK-I and GK-II and other places whose names already contain a sequence of characters that forms a valid name of another place.
It's rather obvious that this is an academic exercise, so you probably shouldn't be bound to real-world data that was never intended to be used in the way you're using it. In that context, I would just reduce the data to a smaller subset and normalize it in situ.
Thanks a lot. Inter alia, one thing which i've certainly learned is that it's Connaught Place and not Connought Place :D
Originally Posted by Darryl.Burke
You know - I'm just thinking aloud because I have no clue about how it would be done - we recognise "Connaught" vs "Connought" as a possible error because of the lexical similarity. It relatively easy to quantify this (number of letter differences or whatever) and on that basis at least flag it as a possible error.
(Then we can defer to the knowledgable and careful likes of db.)
But as Darryl points out this is going to flag a lot of things as possible equivalents when they aren't: the GK effect. However things might be better if more information were used. I'm thinking particularly of topological characteristics of a given entry. There is little chance of confusing two "Main St"s despite identity of their lexical form because they join (or for places are adjacent to) quite different and altogether distinct places.
So I'm wondering if you could quantify these topological differences somehow. And on that basis more accurately identify possible equivalents.
(I've just realised that a recent youtube video I saw might have suggested this: it was about people taking maps and performing automatic registration - ie map to geolocation - based on the topology of street intersections.)