Results 1 to 3 of 3
  1. #1
    Pierced1 is offline Member
    Join Date
    Feb 2009
    Posts
    12
    Rep Power
    0

    Question crawler doesn't start threads.

    Hello i have made a forum crawler for downloading wallpaper images that is working fine with running with single thread but when i try to make it use several threads (for speed improvement) it doesn't work.

    I have also tried "(new Thread(new Crawler())).start();" it only starts new classes that all does the same pages and saves the same pictures over and over again. But as it is now it doesn't even start the new threads.

    here is my code (any other improvement suggestions are welcome), the methods that are related to the problem are main() and run(). I would like one thread taking care of one webpage (searching,downloading etc.) and the second(next) thread taking care of the next webpage simultaneous.

    (it saves images to c:/crawler and since i can't link directly to a forum because of to few posts you have to google "forum devian wallpaper" when it asks for webadress.)

    Java Code:
    /*
     * To change this template, choose Tools | Templates
     * and open the template in the editor.
     */
    
    package crudecrawler;
    
    import com.sun.image.codec.jpeg.JPEGCodec;
    import java.awt.Image;
    import java.awt.image.BufferedImage;
    import java.io.BufferedReader;
    import java.io.File;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.InputStreamReader;
    import java.net.URL;
    import java.util.ArrayList;
    import java.util.logging.Level;
    import java.util.logging.Logger;
    import javax.imageio.ImageIO;
    import javax.swing.JOptionPane;
    
    
    /**
     *
     * @author Pierced1
     */
    public class Crawler implements Runnable {
        Crawler crawl = new Crawler();
        
        static int nrThreads =0; //number of active threads
        ArrayList<String> thisPage = new ArrayList<String>(); //holds all image-links from one webpage
        static String page; //page to be searched
        static int pageFirst; //startpage
        static int pageLast; //endpage
        int counter=1; //counter of downloaded images, used for naming pics
        static int i; //pagecounter, current webpage
    
        /**
         * (should) create a new thread for every loop aslong as nrThreads is less or equal to 10, will hold if to many threads are active.
         * @param args
         * @throws IOException
         */
        public static void main(String[] args) throws IOException{
            try {
                page = JOptionPane.showInputDialog("webadress");
                pageFirst = Integer.parseInt(JOptionPane.showInputDialog("Start page"));
                pageLast = Integer.parseInt(JOptionPane.showInputDialog("End page"));
            for (i = 0; pageFirst < pageLast; i++) {
                if(nrThreads<11){
                    Thread t = new Thread();
                    t.start();
                    nrThreads++;
                    System.out.println("thread created"+t);
                }else
                    i--;
            }
            
            } catch (Exception e) {
                System.out.println("thread "+e);
            }
        }
    
        /**
         *starts webpage method and sends with it the current webpage (i).
         */
        public void run() {
            try {
                System.out.println("running "+i);
                webpage(page + i+"/");
                thisPage.clear();
                System.out.flush();
                JOptionPane.showMessageDialog(null, pageFirst + "-" + pageLast + " OK");
            } catch (IOException ex) {
                Logger.getLogger(Crawler.class.getName()).log(Level.SEVERE, null, ex);
                System.out.println("run "+page+i );
                nrThreads--;
            }
        }
    
        /**
         * adds image-links to array
         * @param adr
         * @throws IOException
         */
        public void webpage(String adr) throws IOException{
                try {
                URL url = new URL (adr);
                BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
                  String line=null;
                while((line = in.readLine())!=null){
                    if(line.contains("No Thread specified.")){
                        System.out.println("breaking page");
                        break;
                    }
                    if(line.contains(".jpg")){
                        String s = line;
                        int a,b,c;
                        while(s.contains(".jpg")){
                            a = s.indexOf("<a href=\"");
                            c = s.indexOf("<img src=\"");
                            if(a<0&&c<0)
                                break;
                            if(a<0)
                                a=100;
                            if(c<0)
                                c=100;
                            int d = java.lang.Math.min(a,c);
                            b = s.indexOf("\"", d+15);
                            String p = s.substring(d+9, b).replace("\"", "");
                            s = s.substring(b+1, s.length());
                            if(!thisPage.contains(p)&&p.contains(".jpg")&&p.contains("http")&&!p.contains("thumb"))
                                thisPage.add(p);
                        }
                    }
                }
                in.close();
                System.out.flush();
                System.out.println(thisPage);
                    downloadPics();
                } catch (Exception e) {
                    System.out.println(e+" adress="+adr);
                    nrThreads--;
                    
                }
    }
    
        /**
         * Downloads the image to folder
         */
        private void downloadPics() {
            BufferedImage buf=null;
            Image img=null;
            for (String wp : thisPage) {
                try {
                    if(wp.contains(".php")){
                        String s = locatePic(wp);
                        if(!thisPage.contains(s)&&!s.equals(null)){
                            buf = ImageIO.read(new URL(s));
                            if(buf.getWidth()>300&&buf.getHeight()>300&&buf.getWidth()<2000&&buf.getHeight()<2000){
                                    File file = new File("C:\\crawler\\"+"g"+i+"_"+counter+".jpg");
                                    JPEGCodec.createJPEGEncoder(new FileOutputStream(file)).encode(buf);
                                    buf.flush();
                                    System.out.println("image saved "+file);
                                    counter++;
                            }else
                                buf.flush();
                        }
                    }else{
                        buf = ImageIO.read(new URL(wp));
                        if(buf.getWidth()>300&&buf.getHeight()>300&&buf.getWidth()<2500&&buf.getHeight()<2500){
                            File file = new File("C:\\crawler\\"+"g"+i+"_"+counter+".jpg");
                            JPEGCodec.createJPEGEncoder(new FileOutputStream(file)).encode(buf);
                            buf.flush();
                            System.out.println("image saved "+file);
                            counter++;
                        }else{
                            buf.flush();
                        }
    
                    }
                } catch (IOException e) {
                                System.out.println("image "+e);
                }
    
            }
        }
    /**
     * locate image in webpage
     * @param string
     * @return
     */
        private String locatePic(String string) {
            String newString = "https://mail.google.com/mail/help/images/logo2.gif"; //if nothing is found
            int x,y,z,c=1,hit=0;
            try {
                //gets frist part of adress
                x = string.indexOf(".php?");
                String search = string.substring(x-1,x);
                while(!search.contains("/")){
                    search = string.substring(x-c,x);
                    c++;
                    hit=c;
                }
                if(hit>0)
                    hit=hit-1;
                x=x-(hit);
                String first = string.substring(0,x);
    
                //to add source for pic
                URL url = new URL (string);
                BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
                  String line;
                while((line = in.readLine())!=null){
                    if(line.contains(".jpg")&&line.contains("SRC=\"")){
                            y = line.indexOf("SRC=\"");
                            z = line.indexOf("\"", y+10);
                            String last = line.substring(y+5, z);
                            newString=first+"/"+last;
                        }
                    }
                in.close();
                System.out.flush();
                } catch (Exception e) {
                    System.out.println(e+" "+string);
                    nrThreads--;
                }
            return newString;
        }
    
    
    
    }

  2. #2
    mrmatt1111's Avatar
    mrmatt1111 is offline Senior Member
    Join Date
    Aug 2009
    Location
    San Jose, CA, USA
    Posts
    320
    Rep Power
    5

    Default

    When working with threads you must be very careful with shared variables and data. Any data that is shared must be protected from being accessed at the same time by multiple threads by using locks and synchronized methods.

    I didn't actually debug your code, but you seem to have a number of static variables that would worry me. It seems you need should find a better way to manager what pages are being sent to what thread. You could either have a thread live with one page and when it is finished start a new thread, or you could used blocking queues to safely pull the next page of a queue for a thread to consume.
    My Hobby Project: LegacyClone

  3. #3
    deschutron is offline Member
    Join Date
    Sep 2009
    Posts
    9
    Rep Power
    0

    Default

    1. Why it doesn't run in more than one thread:
    your code to start each thread is:
    > Thread t = new Thread();
    > t.start();
    The new threads need to be passed instances of your Crawler class. Otherwise they have no connection to your class.
    In other words, using "new Thread(new Crawler())" is an improvement.

    2. Why they work on the same pages:
    As far as I can tell, the Crawler objects do not communicate with each other, take in any parameters, or use any randomisation. Therefore they cannot help but do the same thing.
    On approach you can use is to create a constructor that takes an argument
    e.g. Crawler(String pageIn) { page = pageIn; }
    and sets the page to read.
    then main() can say

    String[] pages = new String[] {"url1", "url2", "url3"};
    for (int i = 0; i < pages.length; i++) {
    new Thread(new Crawler(pages[i])).start();
    }

    and three pages will be crawled.

    3. Something else I noticed:
    >for (i = 0; pageFirst < pageLast; i++) {
    > if (nrThreads < 11) {
    > ...
    > } else
    > i--;
    >}
    This should loop forever.
    pageFirst is never modified, so it can never reach pageLast. And when enough threads have been made, you decrement i. That is unnecessary now and might cause you problems later.

    EDIT:
    I'm sorry, I didn't notice i is a class level static variable, and is not declared inside main(). Based on this, decrementing i still looks wrong, but I can't say for sure that it is without reading all of your class again.
    Last edited by deschutron; 09-28-2009 at 08:07 PM. Reason: failed to notice true scope of i

Similar Threads

  1. Replies: 5
    Last Post: 03-20-2009, 10:04 PM
  2. Search Engine , Web Crawler
    By sahil.ansari in forum Advanced Java
    Replies: 5
    Last Post: 07-21-2008, 01:53 AM
  3. Threads!
    By rameshraj in forum Advanced Java
    Replies: 1
    Last Post: 05-04-2008, 04:11 PM
  4. Using threads
    By Java Tip in forum Java Tip
    Replies: 0
    Last Post: 12-11-2007, 10:25 AM
  5. Threads
    By one198 in forum Threads and Synchronization
    Replies: 1
    Last Post: 11-20-2007, 06:15 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •