Results 1 to 16 of 16
  1. #1
    Rawdogz is offline Member
    Join Date
    Mar 2013
    Posts
    11
    Rep Power
    0

    Default Removing Certain Punctuation Char

    hi all, need a little help trying to figure out how to remove char in my input file

    the following char i have to be remove

    i. End of line characters . ? !

    ii. Separators: , : ; / \ |

    iii. Double quotes

    iv. Special characters: ^ * + = _

    v. Grouping characters ( ) { } [ ] < >

    vi. Single quotes (only if the LAST character of a token)

    the following char can not be removed
    i. Numbers

    ii. Apostrophes (for contractions or possession, like isn’t or Dave’s)

    iii. Dashes (for word conjunction, like side-by-side)

    iv. Special characters: @ # $ %


    from what i have researched .replace is quicker then .replaceAll and easier to use. this is what i have come up with so far

    String updated = s.replace("[,.?!]","");\\ only did the first line as a tester
    String updated2 = updated.replace("."," ");

    i am able to remove the (. and ,) but not the other char. this is what the input text says:

    This is my
    file, yes my file

    My file.???! : ||\?



    any help would be appreciated

  2. #2
    jim829 is offline Senior Member
    Join Date
    Jan 2013
    Location
    Northern Virginia, United States
    Posts
    3,534
    Rep Power
    5

    Default Re: Removing Certain Punctuation Char

    Some characters need to be escaped because they are special to the regex engine. If you are uncertain which ones, just escape them all.

    Java Code:
    String a = "[a[b?c!d,e.f]";
    String b = a.replaceAll("[\\[\\]\\?\\!\\,\\.]+","");
    System.out.println(b + " " + b.length());
    To escape them requires two backslashes. To escape say [ in a regex you need to put in a \[. But since String intercepts the first back slash for its own escape support, you need to put in two \\ to pass one to the regex engine. All the characters are put between [ ] followed by a + which means match 1 or more of any of the characters within the square brackets.

    Regards,
    Jim
    The Java™ Tutorial | SSCCE | Java Naming Conventions
    Poor planning our your part does not constitute an emergency on my part.

  3. #3
    Rawdogz is offline Member
    Join Date
    Mar 2013
    Posts
    11
    Rep Power
    0

    Default Re: Removing Certain Punctuation Char

    ahh got it, you are the person!!! :)

  4. #4
    Rawdogz is offline Member
    Join Date
    Mar 2013
    Posts
    11
    Rep Power
    0

    Default Re: Removing Certain Punctuation Char

    run into a little snag when i enter in everything, this is how i did it.


    Java Code:
    String updated = s.replace(".,?!:;/\\|^*+=_(){}[]<>");
                 String updated2 = updated.replaceAll("[\\.\\,\\?\\!\\:\\;\\/\\\\\|\\^\\*\\+\\=\\_\\(\\)\\{\\}\\[\\]\\<\\>]+"," ");
    and this is what i got

    1 error found:
    Error: Invalid escape sequence (valid ones are \b \t \n \f \r \" \' \\ )

  5. #5
    Rawdogz is offline Member
    Join Date
    Mar 2013
    Posts
    11
    Rep Power
    0

    Default Re: Removing Certain Punctuation Char

    i got it now the only one i am have problem with now it \\"

  6. #6
    jim829 is offline Senior Member
    Join Date
    Jan 2013
    Location
    Northern Virginia, United States
    Posts
    3,534
    Rep Power
    5

    Default Re: Removing Certain Punctuation Char

    To escape a slash, you need \\ but do do that try \\\\.

    Regards,
    Jim
    The Java™ Tutorial | SSCCE | Java Naming Conventions
    Poor planning our your part does not constitute an emergency on my part.

  7. #7
    kalata is offline Member
    Join Date
    Aug 2011
    Location
    Bulgaria
    Posts
    29
    Rep Power
    0

    Default Re: Removing Certain Punctuation Char

    Actually when using .replaceAll(...) the "+" at the end of the regex is not needed, also when symbols in regex are between square braces only ' [ ',' " ', ' ] ' and ' \ ' need to be escaped.

    This is the snippet I tried is a little bit modified version of the example that Jim gave:
    Java Code:
    String a = "My test string.,?!:;/\\|^\"\\*+=_(){}[]<>.,?!:;/\\|^\"\\*+=_(){}[]<>";
    String b = a.replaceAll("[.,?!:;/\\\\`'\"|^*+=_(){}\\[\\]<>]","");
    System.out.println(b + " " + b.length());
    BR,
    Kalin

  8. #8
    jim829 is offline Senior Member
    Join Date
    Jan 2013
    Location
    Northern Virginia, United States
    Posts
    3,534
    Rep Power
    5

    Default Re: Removing Certain Punctuation Char

    It's not needed but it is more efficient as it replaces groups instead of one at a time. With replacing with the null string, the result is the same. With replacing with other non-null strings, the results are different.

    Regards,
    Jim
    The Java™ Tutorial | SSCCE | Java Naming Conventions
    Poor planning our your part does not constitute an emergency on my part.

  9. #9
    Rawdogz is offline Member
    Join Date
    Mar 2013
    Posts
    11
    Rep Power
    0

    Default Re: Removing Certain Punctuation Char

    i hit a wall again and need a little help. i can't figure out the last two steps in this problem or where to inject it.



    6. Each ArrayList entry will have two parts. The first part is the scrubbed token (word) and the second part is the count of how many times this word appears in the file
    for example: if the token ‘hello’ has been seen 4 times, the ArrayList entry would be:
    hello *4
    if the token ‘help’ is being newly inserted into the ArrayList, the entry should be:
    help *1

    a. This means that only the first part of each ArrayList entry will be used to determine the sort ordering

    b. When deciding where to insert a new token in the ArrayList, walk the ArrayList from the front until an insert point is reached. If the same token is found in the ArrayList, don’t insert a new element, just increment the second part. For example: if the current token is ‘help’ and ‘help *5’ is already in the ArrayList, ‘help *5’ should be updated to ‘help *6’

    7. After all tokens are read from the original file, a new file should be written with each ArrayList entry on a single line. The new file should be named: <Original file name>_sorted.txt

    If the original file contained:

    This is my
    file, yes my file

    My file.



    The output file should contain:

    file *3
    is *1
    my *3
    this *1
    yes *1



    here is my code so far :

    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.InputStreamReader;
    import java.io.PrintStream;
    import java.util.ArrayList;
    import java.util.Collections;

    public class Extra {

    public Extra() {

    }


    public static void main(String[] args) {

    try
    {
    ArrayList myArraylist = new ArrayList();

    System.out.println("Please Enter file");

    InputStreamReader istream = new InputStreamReader(System.in) ;

    BufferedReader bufRead = new BufferedReader(istream) ;
    String fileName = bufRead.readLine();

    BufferedReader file = new BufferedReader(new FileReader(fileName));
    while(true)
    {
    String s = file.readLine();

    if (s == null)
    {
    break;
    }

    String updated = s.replace(".,?!:;/\\|^*+=_(){}[]<>","");
    String updated2 = updated.replaceAll("[\\.\\,\\?\\!\\:\\;\\/\\|\\\\\\^\\*\\+\\=\\_\\(\\)\\{\\}\\[\\]\\<\\>\"]+"," ");


    //note: missing Single quotes (only if the LAST character of a token)

    System.out.println( updated2.toLowerCase());
    myArraylist.add(updated2.toLowerCase());

    String word = updated2.toLowerCase();


    }

    //public static void

    Collections.sort(myArraylist);

    String outPutFileName = fileName + "sorted.txt";

    // i believe the counter should/would go here************


    PrintStream ps = new PrintStream( outPutFileName );

    ps.print(myArraylist.toString());

    ps.flush();

    ps.close();


    }

    catch (Exception e){
    System.out.println(e.toString());
    }
    }


    /*

    public int countWords(int[] word ){


    int count = 0;

    for(int i=0;i<word.length;i++){
    word = words[i];
    for (int j=0; i<words.length;j++){
    if(words[j+1].equals(word)){
    count++;
    }
    }
    }

    }

    */


    }

    appreciated the help

  10. #10
    kalata is offline Member
    Join Date
    Aug 2011
    Location
    Bulgaria
    Posts
    29
    Rep Power
    0

    Default Re: Removing Certain Punctuation Char

    Hi,

    @Jim - I did some more test of what you said and I do see the difference, thanks for the lesson :)

    @Rawdogz - I have several observations:
    1. String updated = s.replace(".,?!:;/\\|^*+=_(){}[]<>",""); - you don't need this line; it will replace only this substring ".,?!:;/\\|^*+=_(){}[]<>", but it won't do anything if you have for example ".,?!:;/\\| some text ^*+=_(){}[]<>" and the regex in the next line covers that case also.
    The next line - String updated2 = updated.replaceAll("[\\.\\,\\?\\!\\:\\;\\/\\|\\\\\\^\\*\\+\\=\\_\\(\\)\\{\\}\\[\\]\\<\\>\"]+"," "); - should become as follows:
    String updated2 = s.replaceAll("[\\.\\,\\?\\!\\:\\;\\/\\|\\\\\\^\\*\\+\\=\\_\\(\\)\\{\\}\\[\\]\\<\\>\"]+"," ");

    2. ArrayList myArraylist = new ArrayList(); should be like this: ArrayList<String> myArraylist = new ArrayList<String>();
    Basically you say this: "My ArrayList doesn't contain objects of some random type, it contains strings". (have a look at Generics for more detailed info on the topic :) )

    3. String word = updated2.toLowerCase(); - this line is obsolete - never used actually :)

    4. I would change this:
    Java Code:
    while(true) 
    {
    String s = file.readLine();
    
    if (s == null)
    {
    break;
    }
    // ...
    }
    into that:

    Java Code:
    String s = null;
    while((s = file.readLine()) != null) {
    // ...
    }
    Don't get me wrong - the way you did it works fine, but as far as I'm concerned infinite loops are not good practice(of course there are exceptions to that). :)

    5. Once sorted, myArrayList will look like this:
    file, file, file, is, my, my, my, this, yes

    You're right that the counting should go there. Can you see what you have to do to get the wanted result? :)

    6. The commented method returns a single integer, so that would be the count for one word, so you want to pass the word as parameter and also the list, because you will count how many times given word is present in it.
    The method should look like this:
    Java Code:
    public static int countWords(String word, List<String> wordsList) {
    int result = 0;
    // your counting goes here
    return result;
    }
    BR,
    Kalin

    P.S. I may not explained my points very well, so if you have any questions, I'll gladly answer :)

  11. #11
    Rawdogz is offline Member
    Join Date
    Mar 2013
    Posts
    11
    Rep Power
    0

    Default Re: Removing Certain Punctuation Char

    Thank you very much for your help and the clean up suggestions.
    i think your explanation is good, it more my understand of it all :)

    but i think i broke my program now :(


    Java Code:
    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.InputStreamReader;
    import java.io.PrintStream;
    import java.util.ArrayList;
    import java.util.Collections;
    
    public class cleanup {
    
     public cleanup() {
      
     }
    
    
     public static void main(String[] args) {
      
      try
      {
       ArrayList myArraylist = new ArrayList();
       
       System.out.println("Please Enter file");
       
       InputStreamReader istream = new InputStreamReader(System.in) ;
     
       BufferedReader bufRead = new BufferedReader(istream) ;
       String fileName = bufRead.readLine();
     
               BufferedReader file = new BufferedReader(new FileReader(fileName));
                         
               String s = null;
               while((s = file.readLine()) != null) {
                            
                 String updated2 = s.replaceAll("[\\.\\,\\?\\!\\:\\;\\/\\|\\\\\\^\\*\\+\\=\\_\\(\\)\\{\\}\\[\\]\\<\\>\"]+"," ");  
                  
    
                 //note: missing Single quotes (only if the LAST character of a token)
                 
                 System.out.println( updated2.toLowerCase());
                 myArraylist.add(updated2.toLowerCase());
                           
             }
                
               Collections.sort(myArraylist);
               
               String outPutFileName =  fileName + "sorted.txt";
               
               
               
      public static int countWords(String word, List<String> wordsList) {
        int result = 0;
        for(int i=0;i<word.length;i++){
                   word = words[i];
                   for (int j=0; i<words.length;j++){
                     if(words[j+1].equals(word)){
                       count++;
                     }
        return result;
                }
           }
      }
               
     
    
    //public static void          
               
               PrintStream ps = new PrintStream( outPutFileName );
    
               ps.print(myArraylist.toString());
    
               ps.flush();
    
               ps.close();
               
            
      }
      
       catch (Exception e){
                   System.out.println(e.toString());
             }
     }
    
    
     
     
    }

  12. #12
    kalata is offline Member
    Join Date
    Aug 2011
    Location
    Bulgaria
    Posts
    29
    Rep Power
    0

    Default Re: Removing Certain Punctuation Char

    Hi, make ArrayList myArraylist = new ArrayList(); to be ArrayList<String> myArraylist = new ArrayList<String>();.
    Next you define method in the body of an other method witch is illegal(and that breaks your application).
    Your class should look like:
    Java Code:
    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.InputStreamReader;
    import java.io.PrintStream;
    import java.util.ArrayList;
    import java.util.Collections;
     
    public class cleanup {
     
        public cleanup() {
       
        }
    
        public static void main(String[] args) {
            //...
        }
    
        public static int countWords(String word, List<String> wordsList) {
            //...
        }
    }
    Also have a careful look at how you do the counting in the method countWords.(Hints: now you use objects that are not declared anywhere, also you need only one loop).

    BR,
    Kalin

  13. #13
    Rawdogz is offline Member
    Join Date
    Mar 2013
    Posts
    11
    Rep Power
    0

    Default Re: Removing Certain Punctuation Char

    i did some reading and people are saying a string tokenizer is better to use for the counter. here is my atemp at it will my understanding of how it works. program is still not working thou.

    [code]import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.InputStreamReader;
    import java.io.PrintStream;
    import java.util.ArrayList;
    import java.util.Collections;
    import java.util.*;


    public class cleanup {

    public cleanup() {
    }



    public static void main(String[] args) {

    try
    {
    ArrayList myArraylist = new ArrayList();

    System.out.println("Please Enter file");

    InputStreamReader istream = new InputStreamReader(System.in) ;

    BufferedReader bufRead = new BufferedReader(istream) ;
    String fileName = bufRead.readLine();

    BufferedReader file = new BufferedReader(new FileReader(fileName));

    String s = null;
    while((s = file.readLine()) != null) {

    String updated2 = s.replaceAll("[\\.\\,\\?\\!\\:\\;\\/\\|\\\\\\^\\*\\+\\=\\_\\(\\)\\{\\}\\[\\]\\<\\>\"]+"," ");


    //note to self: missing Single quotes (only if the LAST character of a token)

    System.out.println( updated2.toLowerCase());
    myArraylist.add(updated2.toLowerCase());

    StringTokenizer st = new StringTokenizer(updated2.toLowerCase());
    while (st.hasMoreTokens()) {
    String nextToken = st.nextToken();

    String myKeyValue = (String)myMap.get(nextToken);
    if(myKeyValue == null)
    {
    myMap.put(nextToken, "1");
    }
    else
    {
    int mycount = Integer.parseInt(myKeyValue) + 1;
    myMap.put(nextToken, String.valueOf(mycount));
    }
    System.out.println(nextToken);
    }

    Collections.sort(myArraylist);

    String outPutFileName = fileName + "sorted.txt";



    PrintStream ps = new PrintStream( outPutFileName );

    ps.print(myArraylist.toString());

    ps.flush();

    ps.close();


    }

    catch (Exception e){
    System.out.println(e.toString());
    }
    }





    }



    }
    [\code]

  14. #14
    kalata is offline Member
    Join Date
    Aug 2011
    Location
    Bulgaria
    Posts
    29
    Rep Power
    0

    Default Re: Removing Certain Punctuation Char

    Hi, what IDE are you using? It seems to me that you are using Notepad++ or another simple text editor :) IDEs like Eclipse will point you the compilation errors :)

    StringTokenizer is faster and that's why it is considered better in some way.
    Now to the compilation errors..
    Yet again you are using an undeclared object(myMap). Maybe you intended it to be HashMap<String, String>, but TreeMap<String, Integer> is better choice in your case.
    Now add declaration & initialization of the map before the inner while-loop:

    Java Code:
    TreeMap<String, Integer> myMap = new TreeMap<String, Integer>();
    StringTokenizer st = new StringTokenizer(updated2.toLowerCase());
    Integer is wrapper class for the primitive type int and you can use it as int. With using Integer you get rid of the parsing, which is something slow, and then making the int string again.
    So this:
    Java Code:
    if(myKeyValue == null)
    {
    myMap.put(nextToken, "1");
    }
    else
    {
    int mycount = Integer.parseInt(myKeyValue) + 1;
    myMap.put(nextToken, String.valueOf(mycount));
    }
    becomes that:
    Java Code:
    if(myKeyValue == null)
    {
    myMap.put(nextToken, 1);
    }
    else
    {
    myMap.put(nextToken, myMap.get(nextToken) + 1);
    }
    The difference between HashMap and TreeMap is that TreeMap supports ordering. When the constructor with no args is used, the TreeMap orders the key values in their natural order. So you won't need this anymore - Collections.sort(myArraylist);. Also you won't need that ArrayList anymore.

    And you need to move one of the curly braces below the catch block above the catch block:
    Before:
    Java Code:
    }
    
    catch (Exception e){
    System.out.println(e.toString());
    }
    }
    Now:
    Java Code:
    }
    }
    catch (Exception e){
    System.out.println(e.toString());
    
    }
    BR,
    Kalin
    Last edited by kalata; 06-05-2013 at 02:34 PM. Reason: words eater..

  15. #15
    Rawdogz is offline Member
    Join Date
    Mar 2013
    Posts
    11
    Rep Power
    0

    Default Re: Removing Certain Punctuation Char

    first let me say thank you for stick it in with this noob :)

    and you lot me with this treemap stuff

    i made the changes so it look like this now and i am using DR J


    Java Code:
    import java.io.FileReader;
    import java.io.InputStreamReader;
    import java.io.PrintStream;
    import java.util.ArrayList;
    import java.util.Collections;
    import java.util.*;
    
    
    public class cleanup1 {
    
    public cleanup1() {
    }
    
    public static void main(String[] args) {
    
    try
    {
    ArrayList myArraylist = new ArrayList();
    
    System.out.println("Please Enter file");
    
    InputStreamReader istream = new InputStreamReader(System.in) ;
    
    BufferedReader bufRead = new BufferedReader(istream) ;
    String fileName = bufRead.readLine();
    
    BufferedReader file = new BufferedReader(new FileReader(fileName));
    
    String s = null;
    while((s = file.readLine()) != null) {
    
    String updated2 = s.replaceAll("[\\.\\,\\?\\!\\:\\;\\/\\|\\\\\\^\\*\\+\\=\\_\\(\\)\\{\\}\\[\\]\\<\\>\"]+"," ");
    
    
    //note to self: missing Single quotes (only if the LAST character of a token)
    
    System.out.println( updated2.toLowerCase());
    myArraylist.add(updated2.toLowerCase());
    
    myMap= 0;
    
    TreeMap<String, Integer> myMap = new TreeMap<String, Integer>();
    
    StringTokenizer st = new StringTokenizer(updated2.toLowerCase());
    
    while (st.hasMoreTokens()) {
    String nextToken = st.nextToken();
    
    String myKeyValue = (String)myMap.get(nextToken);
    if(myKeyValue == null)
    {
    myMap.put(nextToken, 1);
    }
    else
    {
    myMap.put(nextToken, myMap.get(nextToken) + 1);
    }
    System.out.println(nextToken);
    }
    
    Collections.sort(myArraylist);
    
    String outPutFileName = fileName + "sorted.txt";
    
    
    
    PrintStream ps = new PrintStream( outPutFileName );
    
    ps.print(myArraylist.toString());
    
    ps.flush();
    
    ps.close();
    
    
    }
    }
    catch (Exception e){
    System.out.println(e.toString());
    
    }
    
    
    
    
    
    }
    
    
    
    }

  16. #16
    kalata is offline Member
    Join Date
    Aug 2011
    Location
    Bulgaria
    Posts
    29
    Rep Power
    0

    Default Re: Removing Certain Punctuation Char

    Hi,
    recommend you the following things:

    1) whenever you have problem with some class from the Java API, have a look at it's documentation(the documentation is great).
    In this case with TreeMap have a look at TreeMap (Java Platform SE 6) - The first half of the overview tells you all you need to know for the time being. :)

    2) look carefully whether all your variables/objects that you are using are declared and initialized - once more you are trying to use undeclared object:
    Java Code:
     
    myMap= 0;

    3) try printing the content of the myMap after you finished populating it, just to see what you have and think of what have to do next.
    Java Code:
    for(String key : myMap.keySet()) {
        System.out.println(key + " " + myMap.get(key));
    }
    4) last but not the least - no problem :) just read carefully the replies and think over them - for example you still haven't got rid of the myArraylist(and all reference to it) and I did find out that I misled you where the declaration and initialization of myMap should be - it should be before the outer while-loop instead of the inner. So move this statement
    Java Code:
    TreeMap<String, Integer> myMap = new TreeMap<String, Integer>();
    before the outer while-loop so your code looks like this
    Java Code:
    TreeMap<String, Integer> myMap = new TreeMap<String, Integer>();
    String s = null;
    BR,
    Kalin

Similar Threads

  1. Remove all punctuation from a string?
    By cherrychives in forum New To Java
    Replies: 7
    Last Post: 06-11-2012, 09:37 AM
  2. Word Count That Ignores Punctuation And Space
    By stinson in forum New To Java
    Replies: 1
    Last Post: 03-06-2012, 04:20 AM
  3. Replies: 2
    Last Post: 01-05-2011, 06:16 PM
  4. replaceALL(char oldChar, char newChar) method
    By arson09 in forum New To Java
    Replies: 0
    Last Post: 04-28-2010, 05:48 AM
  5. drawing char by char with Graphics
    By diggitydoggz in forum New To Java
    Replies: 5
    Last Post: 12-27-2008, 12:49 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •