Results 1 to 7 of 7
  1. #1
    FunkyProg is offline Member
    Join Date
    Jan 2011
    Posts
    9
    Rep Power
    0

    Default Page source not the same : getInputStream() vs Ctrl+U

    Hi all!

    I'm using a program to access the page source of a website. I've noticed that the content is not the same as when I do it manually using Ctrl+U in the browser. Some important data are missing in the page source when I use the program, which relies on the getInputStream() method. Do you know why, and how can I fix it?

    Java Code:
    String site = "http://magasin.iga.net/Browse/All.aspx?os=1";
    
    try{
    	URL url = new URL(site);
    	URLConnection connection = url.openConnection();
    	connection.connect();
    	Scanner in = new Scanner(connection.getInputStream());
    
    	while(in.hasNextLine()){	
    
    		line = in.nextLine();
    		(print line to a file)
    									
    				}
    }
    
    catch(Exception e)
    		{
    		System.out.println(e);
    		}
    Many thanks for helping me!

    Funky

  2. #2
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,785
    Blog Entries
    7
    Rep Power
    21

    Default

    Quote Originally Posted by FunkyProg View Post
    Hi all!

    I'm using a program to access the page source of a website. I've noticed that the content is not the same as when I do it manually using Ctrl+U in the browser. Some important data are missing in the page source when I use the program, which relies on the getInputStream() method. Do you know why, and how can I fix it?
    Maybe (part of) what you see in your browser is generated by Javascript or whatever; that would certainly not be part of the source (what is received by the browser or your program).

    kind regards,

    Jos
    cenosillicaphobia: the fear for an empty beer glass

  3. #3
    pbrockway2 is offline Moderator
    Join Date
    Feb 2009
    Location
    New Zealand
    Posts
    4,585
    Rep Power
    12

    Default

    Do you know why, and how can I fix it?
    Ctrl+U is "View Source", right?

    It's hard to say without seeing what's missing but you should bear in mind that what the web server returns when you hit the url depends on information that is sent with the request. Web pages commonly use the "referer" (the page you were on when you clicked the link), the operating system and browser being used to alter the page content.

    These things can be spoofed when you create the URL connection, but it's not really polite to do so as there may be good reasons why page content is changed in this way. (including a wish by the people who create the site's data to make it available to humans using browsers but not automated software such as you are using.)

  4. #4
    FunkyProg is offline Member
    Join Date
    Jan 2011
    Posts
    9
    Rep Power
    0

    Default Missing content & reason to access it

    Thank you guys for your replies!

    Maybe (part of) what you see in your browser is generated by Javascript or whatever; that would certainly not be part of the source (what is received by the browser or your program).
    I guess there is some Javascript involved, as I can see this word in the page source viewed with Ctrl+U. If some content is not in the "real" source, where is it?

    what the web server returns when you hit the url depends on information that is sent with the request. Web pages commonly use the "referer" (the page you were on when you clicked the link), the operating system and browser being used to alter the page content.
    Interesting answer, but I'm wondering if it's possible since I directly access the URL from a blank page when I open firefox. Plus, I clear all cookies every time I close the browser (no referer?).

    It's hard to say without seeing what's missing
    If you access this URL :
    XML Code:
    http://magasin.iga.net/Browse/All.aspx?os=1
    (This is a french grocery store's website)

    you'll see that the information about products are in the "Ctrl+U" page source (e.g. LADY SPEED STICK), but they are missing in the real source.

    a wish by the people who create the site's data to make it available to humans using browsers but not automated software such as you are using.
    Don't be afraid! I just want to know about the weekly rebates. The only difference between reading the flyers manually and using an automated software is my loss of precious time. Okay, you could argue that they want me to go through all the flyers so that I see more products and buy more, but... let's forget the ethical question

    This discussion is becoming quite interesting... I'm looking forward to hearing your replies.

    Funky

  5. #5
    cultclassic is offline Member
    Join Date
    Mar 2011
    Posts
    64
    Rep Power
    0

    Default

    Quote Originally Posted by FunkyProg View Post
    I guess there is some Javascript involved, as I can see this word in the page source viewed with Ctrl+U. If some content is not in the "real" source, where is it?
    Funky
    Like JosAH said, some of the content may be rendered by JavaScript. A lot of modern web pages do that. They send a template HTML page and a script first; The script runs on the browser, communicates back to the server (AJAX), brings in more data and fills the DOM on template page. This lets the site to update content dynamically.

    I went to your URL with the JavaScript turned off. The server redirected me to this page:
    Témoins requis
    So apparently they are using JavaScript to test if your cookies are enabled. In fact they have two cookies called
    Commerece_TestPersistentCookie
    and
    Commerece_TestSessionCookie
    with values TestCookie,
    and several more.

    Here's what I suggest:

    Go to your browser and note all the cookies set by iga.net
    Use URLConnection.setRequestProperty to set the the same cookie values, make the connection and see what happens. This worked for me when I had to do some screen scraping.

    If on the other hand, the content is brought in by AJAX, you'll have to look at the AJAX Javascript calls. AJAX usually communicates in some structured data format, like xml or JSON. So if that is the case, you may get lucky and see the item names and prices in a parseable format....

  6. #6
    FunkyProg is offline Member
    Join Date
    Jan 2011
    Posts
    9
    Rep Power
    0

    Default Already connected

    Use URLConnection.setRequestProperty to set the the same cookie values, make the connection and see what happens.
    I couldn't understand anything from what I've read about AJAX, XML and DOM, so I decided to set cookies like you suggested. Thankfully, I found a website that explained how to do it in details :

    XML Code:
    http://www.hccp.org/java-net-cookie-how-to.html
    So I added these lines into my code :

    Java Code:
    connection.setRequestProperty("Cookie", "Commerce_TestPersistentCookie=TestCookie");
    connection.setRequestProperty("Cookie", "Commerce_TestSessionCookie=TestCookie");
    Then I get this error : "java.lang.IllegalStateException: Already connected"

    In the website cited above, it says I must use setRequestProperty() before connecting. Since the "URLConnection connection" must be initialized before using "connection.setRequestProperty", the only way not to be connected would be to disconnect :

    Java Code:
    String site = "http://magasin.iga.net/Browse/All.aspx?os=1";
    try
    	{
    	URL url = new URL(site);
    	URLConnection connection = url.openConnection();
    (disconnect here)
    connection.setRequestProperty("Cookie", "Commerce_TestPersistentCookie=TestCookie");
    connection.setRequestProperty("Cookie", "Commerce_TestSessionCookie=TestCookie");
    }
    Connecting, then disconnecting and connecting again seems weird to me. Plus, in the above-mentionned website, I don't see this. If you read the sample code provided, you'll see that a connection is opened prior to using the setRequestProperty() function. BTW, I don't want to run a program that I don't understand, so I can't tell if this sample code works like it's claimed.

  7. #7
    FunkyProg is offline Member
    Join Date
    Jan 2011
    Posts
    9
    Rep Power
    0

    Default

    I forgot to mention that I did try to disconnect prior to calling connection.setRequestProperty(), even though it wasn't an elegant solution. However, the connection.disconnect() function didn't seem to exist.

    So my implicit question was : How do I set the cookie values?

    Cultclassic, I really want to try your suggestion, but I need a little help! :)


    Thanks

    Funky

Similar Threads

  1. Web : frame source, not page source
    By FunkyProg in forum New To Java
    Replies: 0
    Last Post: 03-30-2011, 01:49 AM
  2. Replies: 1
    Last Post: 03-27-2011, 01:28 PM
  3. Get source web page after clicking next Button
    By noob in forum New To Java
    Replies: 1
    Last Post: 03-05-2010, 10:01 PM
  4. Page Source
    By fawkes in forum Networking
    Replies: 0
    Last Post: 03-24-2009, 07:06 PM
  5. to retrieve the webcontent from another page source
    By rameshsathasivam in forum JavaServer Pages (JSP) and JSTL
    Replies: 0
    Last Post: 11-25-2008, 09:41 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •