For any given page you should use an HTML parser to parse and process the document in any way you see fit. This allows you to retreive all links etc. Apache also has some really nice libraries in the HTTPComponents sub project.
HTML Parser - HTML Parser
HttpComponents - HttpComponents Overview
Also, if you choose not to elect Java for the task, I would suggest Python.