Page source not the same : getInputStream() vs Ctrl+U
I'm using a program to access the page source of a website. I've noticed that the content is not the same as when I do it manually using Ctrl+U in the browser. Some important data are missing in the page source when I use the program, which relies on the getInputStream() method. Do you know why, and how can I fix it?
Many thanks for helping me!
String site = "http://magasin.iga.net/Browse/All.aspx?os=1";
URL url = new URL(site);
URLConnection connection = url.openConnection();
Scanner in = new Scanner(connection.getInputStream());
line = in.nextLine();
(print line to a file)
Missing content & reason to access it
Thank you guys for your replies!
Interesting answer, but I'm wondering if it's possible since I directly access the URL from a blank page when I open firefox. Plus, I clear all cookies every time I close the browser (no referer?).
what the web server returns when you hit the url depends on information that is sent with the request. Web pages commonly use the "referer" (the page you were on when you clicked the link), the operating system and browser being used to alter the page content.
If you access this URL :
It's hard to say without seeing what's missing
(This is a french grocery store's website)
you'll see that the information about products are in the "Ctrl+U" page source (e.g. LADY SPEED STICK), but they are missing in the real source.
Don't be afraid! I just want to know about the weekly rebates. The only difference between reading the flyers manually and using an automated software is my loss of precious time. Okay, you could argue that they want me to go through all the flyers so that I see more products and buy more, but... let's forget the ethical question
a wish by the people who create the site's data to make it available to humans using browsers but not automated software such as you are using.
This discussion is becoming quite interesting... I'm looking forward to hearing your replies.