View RSS Feed

My Java Tips

Fetching HTML content of a Web Page

Rating: 1 votes, 4.00 average.
by , 11-10-2011 at 04:46 PM (6724 Views)
Sometimes you are required to fetch and store data from web pages. If there are too many pages to parse, then obviously this cannot be done manually. Java provides support for web text extraction.


The approach is simple. You have to fetch all the HTML contents of a webpage and then you can write your own parser to extract the required info. For example: you might be asked to only store the text in table data tag with caption Hobbies. So you will store all the HTML contents of web page in your buffer and then will parse it for Hobbies:

Now lets see how to get HTML contents from a web page. Java.net package provides useful classes that will serve our purpose. We will need following two classes:
- URLConnection
- URL

First create a URL object and specify the address of the page for which you want to get the HTML contents. Then use openConnection method of URLConnection class with URL object to get URLConnection object. Now this URLConnection object can be used to create DataInputStream. Finally, we will create BufferedReader object using DataInputStream object and will fetch the contents line by line using readLine method of DataInputStream object.

Java Code:
URL url = new URL("http://www.java-forums.org/faq.php");
URLConnection conn = url.openConnection();
DataInputStream in = new DataInputStream ( conn.getInputStream (  )  ) ;
BufferedReader d = new BufferedReader(new InputStreamReader(in));
while(d.ready())
{
System.out.println( d.readLine());
}
Output will be HTML code of faq.php. Even the webpage is a PHP page, but our application cannot access the server for PHP (Server side code). We can only request and get HTML code.

Submit "Fetching HTML content of a Web Page" to Facebook Submit "Fetching HTML content of a Web Page" to Digg Submit "Fetching HTML content of a Web Page" to del.icio.us Submit "Fetching HTML content of a Web Page" to StumbleUpon Submit "Fetching HTML content of a Web Page" to Google

Comments