Results 1 to 6 of 6
  1. #1
    sahil.ansari is offline Member
    Join Date
    Jul 2008
    Location
    Delhi
    Posts
    3
    Rep Power
    0

    Default Search Engine , Web Crawler

    Hi,

    I am trying to build a search engine.[Java]


    I have made a web crawler which on being given an initial [starting ]
    URL,

    1.will go to that URL
    2.Retrieve all the URL's present on that web page.
    3.Store them in URL_Database_file.
    4.After finishing with that URL it retrieves the next URL automatically from
    URL_Database_file.[Then repeats steps 1-4].


    Now i want some guidance regarding how to retrieve words from a web pages HTML code, ie:- <p>Sahil Ansari needs Help</p> and then store them in a file.:confused::confused:

  2. #2
    Jeremy is offline Member
    Join Date
    Jul 2008
    Posts
    28
    Rep Power
    0

    Default

    You could read the HTML file in from some sort of InputStream and parse it from there.

  3. #3
    Nicholas Jordan's Avatar
    Nicholas Jordan is offline Senior Member
    Join Date
    Jun 2008
    Location
    Southwest
    Posts
    1,018
    Rep Power
    8

    Thumbs up Pattern :: Matcher

    This is a natch ( naturally acclimatized ) problem domain for REGEX - I need regex practice and will gladly help. Try, for starters, making a Pattern to look for > and <

    You will get stuck right away on what is called escape sequences. Doing good regex patterns for html work is substantially harder than drinking coffee.
    Introduction to Programming Using Java.
    Cybercartography: A new theoretical construct proposed by D.R. Fraser Taylor

  4. #4
    sahil.ansari is offline Member
    Join Date
    Jul 2008
    Location
    Delhi
    Posts
    3
    Rep Power
    0

    Default

    Hi Nicholas Jordan,

    Well i have already done what u said,
    > <, and even after doing this we still need to separate the words individually, it's not easy but i have done that too.
    There will be problem if tags open & close on different lines. <p>.....
    .....</p>

    I am looking for a more sophisticated tool or method if possible.
    This was my final year project B.Tech[Computer Science & Engineering] project.[submitted in May 2008]

    Now i am just trying to improve upo it.
    I have some idea, please tell if it's possible :- a web browser like internet explorer also reads an html code & displays the content, so it has already done all the hard work of ><,

    now instead of displaying it if we can make it to write to file we will get what we want.
    Last edited by sahil.ansari; 07-21-2008 at 02:29 AM.

  5. #5
    sahil.ansari is offline Member
    Join Date
    Jul 2008
    Location
    Delhi
    Posts
    3
    Rep Power
    0

    Default

    Hi,
    parsing is not easy, coz HTML code on web does not have a fixed syntax.

    New tags are added to html
    Tags may not be closed
    Multiline opening & closing tags.
    tags emedded within other tags upto N levels.

    Read my reply given to
    Nicholas Jordan Quote

  6. #6
    Nicholas Jordan's Avatar
    Nicholas Jordan is offline Senior Member
    Join Date
    Jun 2008
    Location
    Southwest
    Posts
    1,018
    Rep Power
    8

    Post first try

    This code here was created using Jeffrey E.F. Friedl's Mastering Regular Expressions by O'Reilly. This is a first scratch and no attempt was made for a compile. Regexs are how it's done. Get over it.
    Java Code:
    import java.util.*;//
    import java.net. ;//
    import java.util.regex.Pattern;//
    import java.util.regex.Matcher;//
    
    public class LinkGrabber
    {
        private int LinkCount = 0x0000;
        private List<URL> EarlsURLs = new LinkedList<URL>();
        private Pattern PickerPattern;
        public LinkGrabber(URL firstURL )
        {
            EarlsURLs.addFirst(firstURL);
            // Addional work later possibly.
            PickerPattern = Pattern.compile("<([^\"]*\"|'[^']*'|[^'\">])*>");//
        }
        public void parsePage()
        {
            ;
        }
    }
    Introduction to Programming Using Java.
    Cybercartography: A new theoretical construct proposed by D.R. Fraser Taylor

Similar Threads

  1. Braindumps Search Engine | http://exams.googletoad.com
    By sates52 in forum Reviews / Advertising
    Replies: 1
    Last Post: 05-03-2008, 04:21 PM
  2. H2 Database Engine 1.0.68
    By JavaBean in forum Java Software
    Replies: 0
    Last Post: 03-16-2008, 09:42 PM
  3. Java Code Snippets Search Engine
    By ZuudoTech in forum Reviews / Advertising
    Replies: 1
    Last Post: 01-18-2008, 10:13 PM
  4. Search Engine
    By SSam Varghese in forum Java Servlet
    Replies: 5
    Last Post: 01-05-2008, 09:26 AM
  5. New Search engine for Java Programmers
    By coolgeek in forum Java Software
    Replies: 0
    Last Post: 07-02-2007, 08:41 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •