Results 1 to 7 of 7
- 03-30-2011, 06:59 AM #1
Member
- Join Date
- Jan 2011
- Posts
- 9
- Rep Power
- 0
Page source not the same : getInputStream() vs Ctrl+U
Hi all!
I'm using a program to access the page source of a website. I've noticed that the content is not the same as when I do it manually using Ctrl+U in the browser. Some important data are missing in the page source when I use the program, which relies on the getInputStream() method. Do you know why, and how can I fix it?
Many thanks for helping me!Java Code:String site = "http://magasin.iga.net/Browse/All.aspx?os=1"; try{ URL url = new URL(site); URLConnection connection = url.openConnection(); connection.connect(); Scanner in = new Scanner(connection.getInputStream()); while(in.hasNextLine()){ line = in.nextLine(); (print line to a file) } } catch(Exception e) { System.out.println(e); }
Funky
- 03-30-2011, 07:11 AM #2
- Join Date
- Sep 2008
- Location
- Voorschoten, the Netherlands
- Posts
- 11,399
- Blog Entries
- 7
- Rep Power
- 17
When people rob a bank they get a penalty; when banks rob people they get a bonus.
- 03-30-2011, 07:15 AM #3
Moderator
- Join Date
- Feb 2009
- Location
- New Zealand
- Posts
- 4,545
- Rep Power
- 11
Ctrl+U is "View Source", right?Do you know why, and how can I fix it?
It's hard to say without seeing what's missing but you should bear in mind that what the web server returns when you hit the url depends on information that is sent with the request. Web pages commonly use the "referer" (the page you were on when you clicked the link), the operating system and browser being used to alter the page content.
These things can be spoofed when you create the URL connection, but it's not really polite to do so as there may be good reasons why page content is changed in this way. (including a wish by the people who create the site's data to make it available to humans using browsers but not automated software such as you are using.)
- 03-30-2011, 05:16 PM #4
Member
- Join Date
- Jan 2011
- Posts
- 9
- Rep Power
- 0
Missing content & reason to access it
Thank you guys for your replies!
I guess there is some Javascript involved, as I can see this word in the page source viewed with Ctrl+U. If some content is not in the "real" source, where is it?Maybe (part of) what you see in your browser is generated by Javascript or whatever; that would certainly not be part of the source (what is received by the browser or your program).
Interesting answer, but I'm wondering if it's possible since I directly access the URL from a blank page when I open firefox. Plus, I clear all cookies every time I close the browser (no referer?).what the web server returns when you hit the url depends on information that is sent with the request. Web pages commonly use the "referer" (the page you were on when you clicked the link), the operating system and browser being used to alter the page content.
If you access this URL :It's hard to say without seeing what's missing(This is a french grocery store's website)XML Code:http://magasin.iga.net/Browse/All.aspx?os=1
you'll see that the information about products are in the "Ctrl+U" page source (e.g. LADY SPEED STICK), but they are missing in the real source.
Don't be afraid! I just want to know about the weekly rebates. The only difference between reading the flyers manually and using an automated software is my loss of precious time. Okay, you could argue that they want me to go through all the flyers so that I see more products and buy more, but... let's forget the ethical questiona wish by the people who create the site's data to make it available to humans using browsers but not automated software such as you are using.
This discussion is becoming quite interesting... I'm looking forward to hearing your replies.
Funky
- 03-30-2011, 06:13 PM #5
Member
- Join Date
- Mar 2011
- Posts
- 64
- Rep Power
- 0
Like JosAH said, some of the content may be rendered by JavaScript. A lot of modern web pages do that. They send a template HTML page and a script first; The script runs on the browser, communicates back to the server (AJAX), brings in more data and fills the DOM on template page. This lets the site to update content dynamically.
I went to your URL with the JavaScript turned off. The server redirected me to this page:
Témoins requis
So apparently they are using JavaScript to test if your cookies are enabled. In fact they have two cookies called
Commerece_TestPersistentCookie
and
Commerece_TestSessionCookie
with values TestCookie,
and several more.
Here's what I suggest:
Go to your browser and note all the cookies set by iga.net
Use URLConnection.setRequestProperty to set the the same cookie values, make the connection and see what happens. This worked for me when I had to do some screen scraping.
If on the other hand, the content is brought in by AJAX, you'll have to look at the AJAX Javascript calls. AJAX usually communicates in some structured data format, like xml or JSON. So if that is the case, you may get lucky and see the item names and prices in a parseable format....
- 03-31-2011, 06:08 AM #6
Member
- Join Date
- Jan 2011
- Posts
- 9
- Rep Power
- 0
Already connected
I couldn't understand anything from what I've read about AJAX, XML and DOM, so I decided to set cookies like you suggested. Thankfully, I found a website that explained how to do it in details :Use URLConnection.setRequestProperty to set the the same cookie values, make the connection and see what happens.
So I added these lines into my code :XML Code:http://www.hccp.org/java-net-cookie-how-to.html
Then I get this error : "java.lang.IllegalStateException: Already connected"Java Code:connection.setRequestProperty("Cookie", "Commerce_TestPersistentCookie=TestCookie"); connection.setRequestProperty("Cookie", "Commerce_TestSessionCookie=TestCookie");
In the website cited above, it says I must use setRequestProperty() before connecting. Since the "URLConnection connection" must be initialized before using "connection.setRequestProperty", the only way not to be connected would be to disconnect :
Connecting, then disconnecting and connecting again seems weird to me. Plus, in the above-mentionned website, I don't see this. If you read the sample code provided, you'll see that a connection is opened prior to using the setRequestProperty() function. BTW, I don't want to run a program that I don't understand, so I can't tell if this sample code works like it's claimed.Java Code:String site = "http://magasin.iga.net/Browse/All.aspx?os=1"; try { URL url = new URL(site); URLConnection connection = url.openConnection(); (disconnect here) connection.setRequestProperty("Cookie", "Commerce_TestPersistentCookie=TestCookie"); connection.setRequestProperty("Cookie", "Commerce_TestSessionCookie=TestCookie"); }
- 04-05-2011, 01:50 AM #7
Member
- Join Date
- Jan 2011
- Posts
- 9
- Rep Power
- 0
I forgot to mention that I did try to disconnect prior to calling connection.setRequestProperty(), even though it wasn't an elegant solution. However, the connection.disconnect() function didn't seem to exist.
So my implicit question was : How do I set the cookie values?
Cultclassic, I really want to try your suggestion, but I need a little help! :)
Thanks
Funky
Similar Threads
-
Web : frame source, not page source
By FunkyProg in forum New To JavaReplies: 0Last Post: 03-30-2011, 12:49 AM -
Login to website and use cookie to get source for another page
By pouria_bayat in forum New To JavaReplies: 1Last Post: 03-27-2011, 12:28 PM -
Get source web page after clicking next Button
By noob in forum New To JavaReplies: 1Last Post: 03-05-2010, 09:01 PM -
Page Source
By fawkes in forum NetworkingReplies: 0Last Post: 03-24-2009, 06:06 PM -
to retrieve the webcontent from another page source
By rameshsathasivam in forum JavaServer Pages (JSP) and JSTLReplies: 0Last Post: 11-25-2008, 08:41 AM


LinkBack URL
About LinkBacks
Reply With Quote

Bookmarks