Regular expressions - extracting paragraphs
Hi there. First post here. I was hoping some one could help me with a tiny problem.
I'm trying to extract a paragraph from a string of text. After hours of learning about regex and trying different combinations, I finally got close to what I wanted. Just a small problem.
For the sake of simplicity, I'm defining a paragraph as a line or lines of text followed by a blank line or EOF. The problem I'm having with my expression is that it returns \r\n which would be ok if I were concerned with the letters, but I'm really only concerned with the index's of the first and last letter/punctuation of the paragraph.
I know I could subtract 2 from the ending index but that seems rather hackish.
This is what I have so far - if I'm going about this all wrong then please let me know.
((\r\n)?+.+)+\r\n\r\n
match = "foobar\\r\\n\\r\\n";
I only want "foobar" not "foobar\r\n\r\n" but I get the latter. How can I resolve this? I tried back referencing it but I apparantly don't know what I'm doing because every time I tried the program hung (infinite loop?)
P.S. I know there a different line terminators on different platforms, but let's keep it simple and only use \r\n for now ;)
Re: Regular expressions - extracting paragraphs
Quote:
... from a string of text.
You've left out one vital piece of information. Where does the "string of text" come from? Reading a file? getText() of a JTextComponent? Database field?
Quote:
match = "foobar\\r\\n\\r\\n";
That String doesn't contain any carriage returns nor linefeeds.
A more clear description will get you more targeted advice, but in the meantime you might want to read up on non-capturing groups. And to get better help sooner, post a SSCCE (Short, Self Contained, Compilable and Executable) example that demonstrates the problem, a class with a main(...) method that members here can copy and run to see where it's going wrong.
db
Re: Regular expressions - extracting paragraphs
Do you mean
match = "foobar\r\n\r\n";
??
In your example, why do you not replace/remove these carriage return and new lines? match = match.replaceAll("\\s", ""); ?? (\s = [ \t\n\x0B\f\r] or write only [\n\r])
If you really mean match = "foobar\\r\\n\\r\\n"; -> match = match.replaceAll(Pattern.quote("\\r\\n"), "");
:(think):
Re: Regular expressions - extracting paragraphs
Quote:
Originally Posted by
DarrylBurke
You've left out one vital piece of information. Where does the "string of text" come from? Reading a file? getText() of a JTextComponent? Database field?
That String doesn't contain any carriage returns nor linefeeds.
A more clear description will get you more targeted advice, but in the meantime you might want to read up on non-capturing groups. And to get better help sooner, post a
SSCCE (Short, Self Contained, Compilable and Executable) example that demonstrates the problem, a class with a main(...) method that members here can copy and run to see where it's going wrong.
db
Thank you!! (non-capturing groups)
Solved: ((\r\n)?+.+)+(?=\r\n\r\n) //although I have a feeling there's a more elegant way
But you bring up another question - When you asked about where the string came from, what difference does it make? I have been testing this reading it in from a file stream one character at a time and appending them to a string. But I eventually plan to integrate my regex's with RTF documents and I have been kind of worried that once I get to that stage if everything will work the way I planned.
Btw: I used to program c++ a long time ago but haven't programmed anything in about 6 years. I've only been learning java for about 4 days. I'm sort of learning on the fly as I work on my project. It's not just regex that I'm unfamiliar with.
Re: Regular expressions - extracting paragraphs
Quote:
Originally Posted by
eRaaaa
In your example, why do you not replace/remove these carriage return and new lines? match = match.replaceAll("\\s", ""); ?? (\s = [ \t\n\x0B\f\r] or write only [\n\r])
:(think):
Because I wanted to simplify it. I tried a million different things. I'm not familiar with regex but I knew for sure \r\n\r\n would match.
Also I didn't want to replace because I want the index's, not the characters. I was afraid replacing would mess up the indexs.
Re: Regular expressions - extracting paragraphs
Quote:
When you asked about where the string came from, what difference does it make? I have been testing this reading it in from a file stream one character at a time and appending them to a string.
It makes a lot of difference. The backslash is an escape character in String literals. Not in Strings read in from a file.
A line in a file Code:
FirstLine\r\nSecond Line
is equivalent to the String literal Code:
"FirstLine\\r\\nSecond Line
and contains no carriage return nor linefeed. The String literal Code:
FirstLine\r\nSecond Line
would be one possibility* when read from a file content of Code:
FirstLine
Second Line
*(depending on how the file was created/saved)
db