Html scraping Site Loads Wrong Jsoup Java
I'm trying to run a script to pull information from a site however when I compare the actual website to the site that my program shows it is not the same.
some examples of what is missing is the beginning !doctype and the companies' info Manufacturing Companies in Minnesota (MN) (if you view source itll show you how the code should work vs a copy and pasted version of the code below
I'm not sure if javascript is part of the issue, i tried turning it off and it still worked, but i also noticed there is a lot of javascript in it; no login is required for the website. Maybe cookies?(I don't know much about cookies)
Code:
String keyword = "http://www.manta.com/mb_43_E7_24/manufacturing/minnesota.php";
Document doc = Jsoup.connect(keyword).referrer("http://www.google.com").userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6").get();
System.out.Println(doc.toString())
;
Above is the code im using
Any ideas why it is failing to load my page the way that my browser does? At one point I had it working but I accidently broke it
And do you have any recommendations to a solution if this solution would not be a reasonable one to pull information from a website?
I put some more work into it and found that it works for http://www.manta.com/ but not if I add the suffex String /mb_43_E7_24/manufacturing/minnesota.php
Is the suffix in anyway involved?
Or might it be the site temporarily banning me for too many requests?
Re: Html scraping Site Loads Wrong Jsoup Java
Try to get the page content with simple URLConnection and BufferedReader. see if you are able to get all the contents of the page then compare it with what you get with Jsoup
HttpURLConnection urlcon = (HttpURLConnection)new URL("http://apache.org").openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(urlcon.getInputStream()));
String line=null;
while ((line=in.readLine()) != null) {
System.out.println(line);
}
If you are able to get all the expected contents this way then it must be some issue with Jsoup and you can try asking your question on Jsoup forum.