Results 1 to 2 of 2
  1. #1
    nechalus is offline Member
    Join Date
    Jun 2010
    Posts
    1
    Rep Power
    0

    Default Extract code of a web page

    Hello everyone,
    I tried to extract the source code of a web page in java.
    So, my class takes as argument the link (http //....) of the page, and creates an output text file containing the source code.
    The program works very well except for google.
    When I insert a link to a page of google results, output file does not contain the true code for this page.

    To understand what I meant by the true source of a page of google results, compare the code obtained by Firefox (view -> source) and that obtained using google chrome (options for developers -> view source).
    A page of google résltats is the page that appears when you search.

    My Java code is as follows:


    public class test {

    public static void getIpFrom(String adresse) {

    try {

    URL url = new URL(adresse);

    URLConnection uc = url.openConnection();

    InputStream in = uc.getInputStream();
    FileOutputStream fos = new FileOutputStream(new File("source.txt"));
    int n =0;
    while((n = in.read()) >= 0)
    {

    fos.write(n);

    }


    in.close();
    fos.close();




    } catch (MalformedURLException e) {

    e.printStackTrace();
    } catch (IOException e) {

    e.printStackTrace();
    }

    }
    public static void main(String[] args) {


    getIpFrom("web link");


    }

    }

  2. #2
    travishein's Avatar
    travishein is offline Senior Member
    Join Date
    Sep 2009
    Location
    Canada
    Posts
    684
    Rep Power
    6

    Default

    it is most likely possible that google, and some other web sites do not render HTML content as it appears in the Firefox -> view source, but instead use JavaScript to manipulate the DOM. In that what we see in the View Source view of Firefox is the current document object, but this does not reflect how it was created. If possible, use a plugin such as "Web Developer" to disable all javascript, then have Firefox load the page.
    The results you see would likely be some HTML code that loads some JavaScript includes, and these would contain the code that performs the transformations to the DOM to render the final result.
    Related, if you were to somehow add a javascript interpreter to your document fetching utility, I am not sure how effective that would be unless you also had a DOM to read the effective 'page source' after javascript transformations have occured. I would guess using a full out Java based HTTP browser, such as the lobobrowser, as a starting point would be a better option than evolving the entire stack of features required in a functional browser ?

Similar Threads

  1. read html code of web page
    By asheeshiit in forum Advanced Java
    Replies: 9
    Last Post: 01-07-2014, 10:31 PM
  2. how to extract video from web page
    By abhishektyagi789 in forum Networking
    Replies: 5
    Last Post: 03-10-2011, 10:29 AM
  3. extract text from xhtml page using jdom
    By nijil in forum New To Java
    Replies: 7
    Last Post: 02-23-2010, 08:50 PM
  4. [SOLVED] retrieve the html code of any web page
    By Omarero in forum New To Java
    Replies: 3
    Last Post: 02-18-2009, 11:20 PM
  5. Next Page Code is not working
    By Java.child in forum AWT / Swing
    Replies: 2
    Last Post: 02-18-2009, 06:26 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •