Results 1 to 9 of 9
  1. #1
    JetLagFox is offline Member
    Join Date
    Jan 2017
    Posts
    12
    Rep Power
    0

    Default Web crawling with Jsoup doesn't scrap what I want

    Hi everyone,

    I have the following code to scrap all the "href" attribute from all <a> elements in the PlayStation webpage:

    https://store.playstation.com/#!/es-...s-store%3Ahome

    Java Code:
     String url = "https://store.playstation.com/#!/es-es/ps4/cid=STORE-MSF75508-PS4CAT%7Cplatform~ps4%7Cname~asc/";
        	    String url2 = "?smcid=nav%3Aps-store%3Ahome";
        		
            	int juegos_totales = 0;
         		
            	
            	ArrayList<String> all_links = new ArrayList<String>();
            	  	
            	int z=0;
            	
            	for (int i=1; i<50; i++) {
                    String urlPage = url+i+url2;
                    
                    System.out.println("Comprobando entrada: " + urlPage);
        			
                    if (getStatusConnectionCode(urlPage) == 200) {
        				
                        Document document = getHtmlDocument(urlPage);
        				
                        Elements entradas = document.select("div.gridViewportPaneWrapper li.cellGridGameStandard");
        				                    
                        // Paseo cada una de las entradas
                        
                        
                        for (Element elem : entradas) {
                        	
                        		Elements links = elem.getElementsByTag("a");
                                for (Element link : links ) {
                                	all_links.add(link.attr("href"));
                                	juegos_totales++;
                                	
                                }
                        		z++;
    
                        	}
                        
                        System.out.println("Hay un total de " + juegos_totales + " juegos");
                        		
                    }
                    
            	}
    It scraps nothing I don't know why...if I try to scrap the title PS4 it does. This code should scrap all the links of the webpage.

  2. #2
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    19,493
    Rep Power
    29

    Default Re: Web crawling with Jsoup doesn't scrap what I want

    Can you post a small, complete program that compiles, executes and shows the problem?
    If you don't understand my response, don't ignore it, ask a question.

  3. #3
    JetLagFox is offline Member
    Join Date
    Jan 2017
    Posts
    12
    Rep Power
    0

    Default Re: Web crawling with Jsoup doesn't scrap what I want

    Quote Originally Posted by Norm View Post
    Can you post a small, complete program that compiles, executes and shows the problem?
    Here you have:

    Java Code:
    public static void main (String args[]) throws IOException {
        	
        	
        	
        	//<--------------- SCRAPPING DE TODOS LOS ENLACES A LOS VIDEOJUEGOS DE XBOX ONE
        	
    
        	
        	    String url = "https://store.playstation.com/#!/es-es/ps4/cid=STORE-MSF75508-PS4CAT%7Cplatform~ps4%7Cname~asc/";
        	    String url2 = "?smcid=nav%3Aps-store%3Ahome";
        		
            	int juegos_totales = 0;
         		
            	
            	ArrayList<String> all_links = new ArrayList<String>();
            	  	
            	int z=0;
            	
            	for (int i=1; i<50; i++) {
                    String urlPage = url+i+url2;
                    
                    System.out.println("Comprobando entrada: " + urlPage);
        			
                    if (getStatusConnectionCode(urlPage) == 200) {
                    	
                        Document document = getHtmlDocument(urlPage);
        				
                        Elements entradas = document.select("div.scenesRoot");
                        
                        for (Element elem : entradas) {
                        	
    
                        	
                    		String links = elem.getElementsByClass("ctnrTitle").text();
                            	all_links.add(links);
                            	juegos_totales++;
    
                    		z++;
    
                    	}
        				                    
                        // Paseo cada una de las entradas
                        
                        
                        for (Element elem : entradas) {
                        	
    
                        	
                        		Elements links = elem.getElementsByTag("a");
                                for (Element link : links ) {
                                	all_links.add(link.attr("href"));
                                	juegos_totales++;
                                	
                                }
                        		z++;
    
                        	}
                        
                        System.out.println("Hay un total de " + juegos_totales + " juegos");
                        		
                    }
                    
            	}
            	
            	//Recorremos cada link de cada juego base para coger todos los juegos que pertenecen a ese juego
            	//bien sea porque es un pack o porque es un complemento.
            	
            	int cantidad_juegos_base = all_links.size();
            	
            	for (int i=1; i< cantidad_juegos_base; i++) { 
            		
            	System.out.print(all_links.size());
            		
                String urlPage = "https://store.playstation.com/" + all_links.get(i);
                System.out.println("Comprobando entradas de: "+urlPage);
        		
                // Compruebo si me da un 200 al hacer la petición
                if (getStatusConnectionCode(urlPage) == 200) {
        			
                    // Obtengo el HTML de la web en un objeto Document2
                    Document document = getHtmlDocument(urlPage);
        			
                    
                    Elements entradas = document.select("div#panelListingProductsListStatic-83");
        			
                    // Paseo cada una de las entradas
                    for (Element elem : entradas) {
                        Elements links = elem.getElementsByTag("a");
                        for (Element link : links ) {
                        	
                        	int igual = 0;
                        	
                        	for (int j=0; j<all_links.size(); j++) {
                        		if (all_links.get(j).equalsIgnoreCase(link.attr("href"))) {
                        			igual++;
                        		}
                        	}
                        	
                        	if (igual == 0) {
                        		all_links.add(link.attr("href"));
                            	juegos_totales++;
                        	} else {
                        		igual = 0;
                        	}
                        	
                        }             
        				
                    }
                    
                    //Aquí se buscan los complementos
                    
                    entradas = document.select("div#panelListingProductsListStatic-82");
        			
                    // Paseo cada una de las entradas
                    for (Element elem : entradas) {
                        Elements links = elem.getElementsByTag("a");
                        for (Element link : links ) {
                        	
                            int igual = 0;
                        	
                        	for (int j=0; j<all_links.size(); j++) {
                        		if (all_links.get(j).equalsIgnoreCase(link.attr("href"))) {
                        			igual++;
                        		}
                        	}
                        	
                        	if (igual == 0) {
                        		all_links.add(link.attr("href"));
                            	juegos_totales++;
                        	} else {
                        		igual = 0;
                        	}
                        }
                        
        				
                    }
        	
                }
            }
        	
        	
        	
            //<--------------- FIN SCRAPPING DE TODOS LOS ENLACES A LOS VIDEOJUEGOS DE XBOX ONE    	
        	
        	
        	
            	System.out.println(all_links.size());
            	
            	String[][] listaEmpresaA = new String[all_links.size()][4];
        		
        		String [] paises = {"España"};
        		
        		int total_columnas = 2 + (paises.length*2);
        			
        		 //Creo la cabecera
        		 
        		 listaEmpresaA[0][0] = "Juego";
        		 listaEmpresaA[0][1] = "URL";
        		 		 
        		 
        		 for (z=0 ; z<paises.length; z++) {
        			 for (int j=4; j<total_columnas ; j=j+3 ) {
        				 listaEmpresaA[0][j-2] = (paises[z]);
        				 listaEmpresaA[0][j-1] = (paises[z] + " PSN");
        			 }
        		 }
    		 
    		int filas = 1; //Recuento de las filas
    		 
    		 //Fin de la creación de la cabecera.
    		
    		
            //<--------------- COMIENZO DEL SCRAPING PARA CONSEGUIR LOS PRECIOS DEL BAZAR USA
    		
    		int contador_juegos = 1;
    		
    		for	(String link : all_links) {
    				 
             String urlPage = "https://store.playstation.com/" + link;
             System.out.println(contador_juegos + ".- Comprobando entradas de: "+urlPage);
    			
             // Compruebo si me da un 200 al hacer la petición
             if (getStatusConnectionCode(urlPage) == 200) {
    				
                 // Obtengo el HTML de la web en un objeto Document2
                 Document document = getHtmlDocument(urlPage);
    				
                 // Busco todas las historias de meneame que estan dentro de: 
                 Elements entradas = document.select("div.rightContent div.header");
    				
                 // Paseo cada una de las entradas
                 for (Element elem : entradas) {
                     String titulo = elem.getElementsByClass("productTitle").text();
                
                     System.out.println(titulo+"\n");
                     listaEmpresaA[filas][0] = (titulo);
                     listaEmpresaA[filas][1] = (urlPage);
    					
                 }
                 
                 entradas = document.select("priceCtnr");
    				
                 // Paseo cada una de las entradas
                 for (Element elem : entradas) {
                     String titulo = elem.getElementsByClass("strikePrice").text();
                     
                     if (titulo.length() == 0) {
                    	 titulo = "-";
                     }
                     
                     System.out.println("Precio base: " + titulo+"\n");
                     listaEmpresaA[filas][2] = (titulo);
                     
                     titulo = elem.getElementsByClass("buyPrice").text();
                     
                     if (titulo.length() == 0) {
                    	 titulo = "-";
                     }
                
                     System.out.println("Precio descuento: " + titulo+"\n");
                     listaEmpresaA[filas][3] = (titulo);
    					
                 }
    
    				
                 
    				
    
                 
                 filas++;
    		
             }
    		 
             contador_juegos++;
    		}
    		
             
             
             
            //<--------------- FIN DEL SCRAPING PARA CONSEGUIR LOS PRECIOS DEL BAZAR USA        
             
             
             for (int i=0; i<all_links.size() ; i++) {
            	 for (int j=0; j<total_columnas ; j++) {
            		 System.out.print(listaEmpresaA[i][j]);
            	 }
            	 
            	 System.out.println(" ");
             }
    	
    		
    		 try {
                 //create .xls and create a worksheet.
                 FileOutputStream fos = new FileOutputStream("D:\\todos_los_juegos.xls");
                 HSSFWorkbook workbook = new HSSFWorkbook();
                 HSSFSheet worksheet = workbook.createSheet("XboxOne");
                 
                 //Escribimos en el Excel toda la información
                	int l=0;
                	
                		//Recorremos las filas
                     for (int f=0; f< all_links.size() ; f++) {
                     	HSSFRow fila = worksheet.createRow(f);
                     	
                     	//Recorremos las columnas
                     	for(int c=0;c<total_columnas;c++){
                     		   HSSFCell celda = fila.createCell(c);
                     		   celda.setCellValue(listaEmpresaA[f][c]);
                     		   l++;
                     		   
                     	}	   
                     }
                                   
                //Save the workbook in .xls file
                 workbook.write(fos);
                 fos.flush();
                 fos.close();
             } catch (FileNotFoundException e) {
                 e.printStackTrace();
             } catch (IOException e) {
                 e.printStackTrace();
             }
                
        }
    It's the same I'm using with Xbox One website, just adapted to the PlayStation's one. The only error it is shown is in the line where I create the element 0,0 of the bidimensional array: listaEmpresaA[0][0] = "Juego"; but is because as all_links array is empty it can't create the element.

  4. #4
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    19,493
    Rep Power
    29

    Default Re: Web crawling with Jsoup doesn't scrap what I want

    Please post a complete program including import statements and the class statement.

    The only error
    Please copy the full text of the error message and paste it here. It has important info about the error.
    If you don't understand my response, don't ignore it, ask a question.

  5. #5
    JetLagFox is offline Member
    Join Date
    Jan 2017
    Posts
    12
    Rep Power
    0

    Default Re: Web crawling with Jsoup doesn't scrap what I want

    Quote Originally Posted by Norm View Post
    Please post a complete program including import statements and the class statement.


    Please copy the full text of the error message and paste it here. It has important info about the error.
    The error is this:

    Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
    at scraping.PS4.main(PS4.java:177)

    But is because the arraylist is empty.

  6. #6
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    19,493
    Rep Power
    29

    Default Re: Web crawling with Jsoup doesn't scrap what I want

    because the arraylist is empty.
    The code should test for the list being empty before trying to access it.
    If you don't understand my response, don't ignore it, ask a question.

  7. #7
    Tolls is offline Moderator
    Join Date
    Apr 2009
    Posts
    13,130
    Rep Power
    23

    Default Re: Web crawling with Jsoup doesn't scrap what I want

    First off (and I was going to post this in the other thread)...you have far too much code in a single method there.
    Far too much.

    Break it down into methods of a few lines...20 at the most.
    Make sure they do something simple.

    What does your listaEmpresaA represent?
    Why is it multi dimensional?

    This seems to cause you issues.
    How do you keep track of the size of the arrays?

    As for why it isn't scraping anything, you'll need to log what is going on in the code.
    Please do not ask for code as refusal often offends.

    ** This space for rent **

  8. #8
    JetLagFox is offline Member
    Join Date
    Jan 2017
    Posts
    12
    Rep Power
    0

    Default Re: Web crawling with Jsoup doesn't scrap what I want

    Quote Originally Posted by Tolls View Post
    First off (and I was going to post this in the other thread)...you have far too much code in a single method there.
    Far too much.

    Break it down into methods of a few lines...20 at the most.
    Make sure they do something simple.
    You are right, is something I would like to change but it makes me difficult to understand how to link different classes. What I know is that they are linked with methods. I should try...let's see.


    What does your listaEmpresaA represent?
    Why is it multi dimensional?
    Lista empresa is a bidimensional array because I need to store all the information for each game (price, release date, price with discount, name, gender...). That's why is multi-dimensional.

    This seems to cause you issues.
    How do you keep track of the size of the arrays?
    listaEmpresaA it's defined the size and I track the size with the variable filas, which means rows.

    all_links array is an arraylist of one dimension.

    As for why it isn't scraping anything, you'll need to log what is going on in the code.
    Could be, but I don't know why I can scrap the Xbox.com site without login. It's difficult to understand for me. Is there any example about how to login with jSoup? I have read something about that and is related to cookies.

    Thanks for your response.


    The code should test for the list being empty before trying to access it.
    You mean I should check if the array is empty before trying to access it?

  9. #9
    Tolls is offline Moderator
    Join Date
    Apr 2009
    Posts
    13,130
    Rep Power
    23

    Default Re: Web crawling with Jsoup doesn't scrap what I want

    Quote Originally Posted by JetLagFox View Post
    You are right, is something I would like to change but it makes me difficult to understand how to link different classes. What I know is that they are linked with methods. I should try...let's see.
    Even simply using methods would help, though that would be simply procedural coding.

    Quote Originally Posted by JetLagFox View Post
    Lista empresa is a bidimensional array because I need to store all the information for each game (price, release date, price with discount, name, gender...). That's why is multi-dimensional.
    So write a java class that represents a game.
    Give it fields that match that data.
    Them you'd only need a List<GameData> and remove one (large) source of bugs, that is tracking the number of columns.


    Quote Originally Posted by JetLagFox View Post
    Could be, but I don't know why I can scrap the Xbox.com site without login. It's difficult to understand for me. Is there any example about how to login with jSoup? I have read something about that and is related to cookies.
    No two sites are the same.
    To scrape a site you need to understand what it is doing, and to do that you need to see what calls it makes to the server.
    That's what the developer tools on all major browsers help you to do.
    A lot of sites put barriers in the way to try and stop this sort of thing.

    One thing you can do is simply see what JSoup has pulled up.
    At the moment you are flying blind, in that you have no logging in your code to show you what is happening.
    Please do not ask for code as refusal often offends.

    ** This space for rent **

Similar Threads

  1. Jsoup help
    By mbschultz97 in forum New To Java
    Replies: 7
    Last Post: 05-12-2014, 03:25 AM
  2. jsoup 1.6.2
    By java software in forum Java Software
    Replies: 0
    Last Post: 04-02-2012, 05:05 PM
  3. crawling
    By aani in forum Networking
    Replies: 0
    Last Post: 10-21-2010, 05:14 PM
  4. crawling
    By Mrs. Deswal in forum Networking
    Replies: 5
    Last Post: 07-01-2009, 11:04 AM

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •