Greetings! Unfortunately, I don't believe your regular expression would work for tags that have multiple lines in them. I think what you need is to compile the pattern manually with Pattern.DOT_ALL like so:
public void testStripTagRegex() { String regex = "\\<.*?>"; Pattern p = pattern.compile(regex, Pattern.DOTALL);
NOTE: put a "\\<" before the regex string and a greater than sign at the end of the regex string above. Ironically, Google filters is because they think it's an HTML tag.
Of course, this means you have to specify all tags which you want to strip out which may not be feasible, besides blacklists are not a good way of doing security since it's usually pretty easy to get around the filter and still have the browser interpret the markup as desired.
So it really depends on what you're trying to accomplish, where your data is coming from and where it's going. e.g. security, from one user, to another user's browser... neither of these is probably what you want. In that case you'd want to encode all the html entities (i.e. < and >) and then replace approved ones with the HTML again.
Here is a way to cleanly remove html tags from text, without using regular expressions: using javax.swing.text.html.HTMLEditorKit http://stackoverflow.com/questions/240546/removing-html-from-a-java-string
Greetings! Unfortunately, I don't believe your regular expression would work for tags that have multiple lines in them. I think what you need is to compile the pattern manually with Pattern.DOT_ALL like so:
ReplyDeletepublic void testStripTagRegex() {
String regex = "\\<.*?>";
Pattern p = pattern.compile(regex, Pattern.DOTALL);
assertTrue(p.matcher("<em>").replaceAll("").length() == 0);
assertTrue(p.matcher("<html \nattributes=\"blah blah blah blah\">").replaceAll("").length() == 0);
assertTrue(p.matcher("</strong>").replaceAll("").length() == 0);
}
Using your regex, the second test fails. Adding DOTALL fixed it.
HTH!
-Mike (R)
P.S. Sorry that the code is munged, Google wouldn't allow pre tags :)
ah ha! this is awesome. thanks!
ReplyDeleteExcept this completely fails for things like: "5 < 6"
ReplyDeleteit seems there isn't any magic way to do this :(
ReplyDeleteAs an alternate solution which doesn't have the issue of killing everything after a less than sign, one could use something like this:
ReplyDeleteString regex = "/?(div|p).*?";
Pattern p = Pattern.compile(regex, Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
NOTE: put a "\\<" before the regex string and a greater than sign at the end of the regex string above. Ironically, Google filters is because they think it's an HTML tag.
Of course, this means you have to specify all tags which you want to strip out which may not be feasible, besides blacklists are not a good way of doing security since it's usually pretty easy to get around the filter and still have the browser interpret the markup as desired.
So it really depends on what you're trying to accomplish, where your data is coming from and where it's going. e.g. security, from one user, to another user's browser... neither of these is probably what you want. In that case you'd want to encode all the html entities (i.e. < and >) and then replace approved ones with the HTML again.
Here is a way to cleanly remove html tags from text, without using regular expressions: using javax.swing.text.html.HTMLEditorKit
ReplyDeletehttp://stackoverflow.com/questions/240546/removing-html-from-a-java-string