Results 1 to 2 of 2
  1. #1
    kevinn205 is offline Member
    Join Date
    Nov 2011
    Posts
    65
    Rep Power
    0

    Default Html scraping Site Loads Wrong Jsoup Java

    I'm trying to run a script to pull information from a site however when I compare the actual website to the site that my program shows it is not the same.

    some examples of what is missing is the beginning !doctype and the companies' info Manufacturing Companies in Minnesota (MN) (if you view source itll show you how the code should work vs a copy and pasted version of the code below

    I'm not sure if javascript is part of the issue, i tried turning it off and it still worked, but i also noticed there is a lot of javascript in it; no login is required for the website. Maybe cookies?(I don't know much about cookies)

    Java Code:
    String keyword = "http://www.manta.com/mb_43_E7_24/manufacturing/minnesota.php";
    Document doc = Jsoup.connect(keyword).referrer("http://www.google.com").userAgent("Mozilla/5.0 (Windows; U;     WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6").get();
    System.out.Println(doc.toString())
    ;
    Above is the code im using

    Any ideas why it is failing to load my page the way that my browser does? At one point I had it working but I accidently broke it

    And do you have any recommendations to a solution if this solution would not be a reasonable one to pull information from a website?

    I put some more work into it and found that it works for http://www.manta.com/ but not if I add the suffex String /mb_43_E7_24/manufacturing/minnesota.php

    Is the suffix in anyway involved?

    Or might it be the site temporarily banning me for too many requests?
    Last edited by kevinn205; 08-26-2012 at 03:19 AM.

  2. #2
    farrukh is offline Member
    Join Date
    Aug 2012
    Posts
    13
    Rep Power
    0

    Default Re: Html scraping Site Loads Wrong Jsoup Java

    Try to get the page content with simple URLConnection and BufferedReader. see if you are able to get all the contents of the page then compare it with what you get with Jsoup

    HttpURLConnection urlcon = (HttpURLConnection)new URL("http://apache.org").openConnection();
    BufferedReader in = new BufferedReader(new InputStreamReader(urlcon.getInputStream()));
    String line=null;
    while ((line=in.readLine()) != null) {
    System.out.println(line);
    }

    If you are able to get all the expected contents this way then it must be some issue with Jsoup and you can try asking your question on Jsoup forum.

Similar Threads

  1. HTML web page parsing scraping
    By francojava1 in forum Advanced Java
    Replies: 0
    Last Post: 10-22-2010, 04:08 PM
  2. html web page parsing/scraping
    By orchid in forum Advanced Java
    Replies: 3
    Last Post: 10-21-2010, 01:34 PM
  3. Replies: 6
    Last Post: 10-10-2008, 05:07 PM
  4. Replies: 0
    Last Post: 10-10-2008, 02:52 PM

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •