Results 1 to 8 of 8
  1. #1
    africanhacker is offline Senior Member
    Join Date
    Feb 2011
    Posts
    107
    Rep Power
    0

    Default Crawl site, find new articles and then email HTML

    Hie guys, I need some advice from the senior members.

    I want to write a program that will crawl about 7 websites, search for new articles, copy that content, and then send it to an email address.

    I'm still at the stage of trying to create a mental picture of the application and to find ways of how things will be done.

    When a new article is found pulling out the data will be quite straight forward. The question I have is this though, how can I use Java to find out if a new article has been posted. These are rather simple sites where a new link pops up if an article is posted.

    Secondly I would also appreciate some suggestions as to how to implement this whole thing. I would like to get my program to run these checks every 30 minutes.

  2. #2
    Solarsonic is offline Senior Member
    Join Date
    Mar 2011
    Posts
    261
    Rep Power
    4

    Default

    Well what you're going to do is use an URLConnection and a BufferedReader. Use substring() and other string methods to play with the html on the site, using it to find what you want and copy what you want. The 30 minute interval is pretty straight forward; use Thread.sleep(1800000) and a loop to do 30 minute intervals.
    Use the javax.mail package to email.

    If you have any specific questions or problems feel free to post them.
    Last edited by Solarsonic; 03-30-2011 at 10:12 PM.

  3. #3
    imorio is offline Senior Member
    Join Date
    Aug 2010
    Posts
    127
    Rep Power
    0

    Default

    I would try to look if there is a specific place the last article can be found. Then check if that article is the last article sent, if not, send it.

  4. #4
    africanhacker is offline Senior Member
    Join Date
    Feb 2011
    Posts
    107
    Rep Power
    0

    Default

    Guys thanks so much to for your contributions here. Lets assume I get to a page and find a div or table where new articles appear:

    Java Code:
    <div id="newArticles>
    
    Link to article 1.
    
    Link to article 2.
    
    Link to article 3.
    
    </div>
    My biggest challenge thus far is getting Java to 'click' on a url and then going to that page.

    Again how would I get my program to know that it has already read this link? Would I have to store a list of links somewhere and have the program read into that and compare in some way?

  5. #5
    africanhacker is offline Senior Member
    Join Date
    Feb 2011
    Posts
    107
    Rep Power
    0

    Default

    @Imorio, clever clever clever. Would not have thought of that so all I would need to do would be to store that last article sent and use that as my benchmark. Nice

  6. #6
    imorio is offline Senior Member
    Join Date
    Aug 2010
    Posts
    127
    Rep Power
    0

    Default

    Again, if the links on the site are in chronological order, you only need to store the last link you visited. Then go down the list untill you find that link (and make sure to handle the case where so many articles have appared that the link isn't in the list anymore) and visit any links above it if there are any. I haven't got much experience with implementing things like this, but my first try would be to lift that link out of the source code as a string, then you have the adres to visit. Then load the article out of the source code of that page.

    If you want the articles sent to you as fast after they appear as possible, it would make an interesting addition to do the checking of each site in a seperate thread. Then make the time between checks vary between sites and depend on the amount of links it retrieves. If it gets 0 new articles, the time between checks should increase, and if it gets multiple articles in one go, the time between checks should decrease.
    This makes the program check sites that update frequently more often then sites that update less frequently. The down side is it will mean 1 e-mail per article (if the interval determining algorithm works well), while if you check all sites in regular intervals, you can send 1 batch with all the new articles in that interval.

    This also brings the question of why use e-mail. Why not display the articles in a list in a gui and be able to view the articles in the program itself. It will require less networking and less e-mails, but more gui work.

    All in all, enough to do.

  7. #7
    africanhacker is offline Senior Member
    Join Date
    Feb 2011
    Posts
    107
    Rep Power
    0

    Default

    The more I read you answers the more stupid I feel. It makes perfect sense to read the url address, take it and then visit. Why am I thinking like a human being, Java does not have to click anything :(

  8. #8
    ozzyman's Avatar
    ozzyman is offline Senior Member
    Join Date
    Mar 2011
    Location
    London, UK
    Posts
    797
    Blog Entries
    2
    Rep Power
    4

    Default

    download the page
    parse the HTML
    - parser finds link
    - add link to list
    finished parsing
    return content
    follow list

Similar Threads

  1. Log4j HTML email
    By apk007 in forum New To Java
    Replies: 0
    Last Post: 09-11-2010, 02:36 AM
  2. How find Links in a HTML Page
    By binumathew in forum Networking
    Replies: 6
    Last Post: 08-03-2010, 07:49 PM
  3. [SOLVED] how to crawl a pdf?
    By Mrs. Deswal in forum Networking
    Replies: 2
    Last Post: 07-20-2009, 10:29 AM
  4. Replies: 6
    Last Post: 10-10-2008, 06:07 PM
  5. Replies: 0
    Last Post: 10-10-2008, 03:52 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •