how to obtain html source of this page
hello
this simple program obtains all utf-8 web pages i examined correctly except this one.
when i enter it`s url in View HTTP Request and Response Header charset is utf-8 but my program doesn`t show correct characters.
can anybody help me?
Code:
package simpleapp;
public class Main {
public static void main(String[] args) {
net myNET = new net();
StringBuffer str = myNET.get_content("http://old.tsetmc.com/Loader.aspx");
}
}
package simpleapp;
import java.net.URL;
import java.net.URLConnection;
import java.net.MalformedURLException;
import java.io.InputStream;
import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class net {
public StringBuffer get_content(String url){
StringBuffer str = new StringBuffer();
try{
URL myURL = new URL(url);
URLConnection myConnection = myURL.openConnection();
InputStream in = myConnection.getInputStream();
BufferedReader myStream = new BufferedReader(new InputStreamReader(in,"utf-8"));
int ch;
while((ch = myStream.read()) != -1){
str.append((char)ch);
}
System.out.print(str);
}
catch(MalformedURLException e){
e.printStackTrace();
}
catch(IOException e){
e.printStackTrace();
}
return str;
}
}
Re: how to obtain html source of this page
Where are you printing out to?
What character sets does that environment support?
Re: how to obtain html source of this page
Quote:
Originally Posted by
Tolls
Where are you printing out to?
What character sets does that environment support?
i print the result in netbeans
Re: how to obtain html source of this page
Actually I just spotted the problem.
Code:
while((ch = myStream.read()) != -1){
str.append((char)ch);
}
That reads a single byte.
UTF-8 characters can be up to 4 bytes long.
So you're completely mucking up any non-ASCII characters there.
Use the readLine() method of the BufferedReader instead.