Friday, September 05, 2008

Regular Expression in Java to Strip HTML Tags

Regular Expression in Java to Strip HTML Tags; just for my own reference:

sMyString = sMyString .replaceAll("\\<.*?>","");



6 comments:

  1. Greetings! Unfortunately, I don't believe your regular expression would work for tags that have multiple lines in them. I think what you need is to compile the pattern manually with Pattern.DOT_ALL like so:

    public void testStripTagRegex() {
    String regex = "\\<.*?>";
    Pattern p = pattern.compile(regex, Pattern.DOTALL);

    assertTrue(p.matcher("<em>").replaceAll("").length() == 0);

    assertTrue(p.matcher("<html \nattributes=\"blah blah blah blah\">").replaceAll("").length() == 0);

    assertTrue(p.matcher("</strong>").replaceAll("").length() == 0);
    }

    Using your regex, the second test fails. Adding DOTALL fixed it.

    HTH!
    -Mike (R)

    P.S. Sorry that the code is munged, Google wouldn't allow pre tags :)

    ReplyDelete
  2. ah ha! this is awesome. thanks!

    ReplyDelete
  3. Except this completely fails for things like: "5 < 6"

    ReplyDelete
  4. it seems there isn't any magic way to do this :(

    ReplyDelete
  5. As an alternate solution which doesn't have the issue of killing everything after a less than sign, one could use something like this:

    String regex = "/?(div|p).*?";
    Pattern p = Pattern.compile(regex, Pattern.DOTALL | Pattern.CASE_INSENSITIVE);

    NOTE: put a "\\<" before the regex string and a greater than sign at the end of the regex string above. Ironically, Google filters is because they think it's an HTML tag.

    Of course, this means you have to specify all tags which you want to strip out which may not be feasible, besides blacklists are not a good way of doing security since it's usually pretty easy to get around the filter and still have the browser interpret the markup as desired.

    So it really depends on what you're trying to accomplish, where your data is coming from and where it's going. e.g. security, from one user, to another user's browser... neither of these is probably what you want. In that case you'd want to encode all the html entities (i.e. < and >) and then replace approved ones with the HTML again.

    ReplyDelete
  6. Anonymous10:09 AM

    Here is a way to cleanly remove html tags from text, without using regular expressions: using javax.swing.text.html.HTMLEditorKit
    http://stackoverflow.com/questions/240546/removing-html-from-a-java-string

    ReplyDelete