0

I need a pcre regex that will select all html img tags without a src part. Long story. With help I got to (?-s)<img(?!.*?src).*?\/> which worked fine until a line got a second img tag WITH src part. The regex matched the first <img with the last /> :(

How can I select the bad part <img border="0" /> from:

<p align="center"><img border="0" /> <a href="http://www.megaevent2014.com/enllac/"><img alt src="http://www.megaevent2014.com/banner/gran/" /></a></p>

In one regular expression.

And the img tags can be invalid for a lot of reasons. Weeding out "border" does not help. I need to select the tags without src, not caring about anything else.

Please advice, Best regards, Peter

  • See this answer on stackoverflow!! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Lord Peter Jun 24 '14 at 12:06
  • @LordPeter True, it cannot be performed in a failproof manner. But in certain limited contexts it can work. – LatinSuD Jun 24 '14 at 12:47

1 Answers1

0

The following regex pattern works for me, and should be well-formed for PCRE Regex:

<img(\s*(?!src)([\w\-])+=([\"\'])[^\"\']+\3)*\s*\/?>
  • To break it down, you start with the literal <img, and then the \s* matches any white space character [\r\n\t\f ] zero or unlimited times.
  • The (?!src) is the negative lookahead, which makes sure that the string src is NOT matched.
  • The second capture group ([\w\-])+ searches for any of [a-zA-Z0-9_] between one and unlimited times, and is greedy (find it as many times as possible), and the \- is a literal looking for a hyphen in case it exists somewhere within the <img> tag pair.
  • The = is a literal search for an equal sign.
  • The third capture group, ([\"\'])[^\"\']+\3 seeks to match either a single or double quote, then anything BUT a single or double quote (one or more times), and then the \3 matches whatever was found as the third capture group (either a single or a double quote.)
  • Finally the \s* matches any white space character [\r\n\t\f ] zero or unlimited times, the \/? matches a forward slash exactly one time, and the > is the closing bracket of the entire affair.

Regex is fun. :-)

dashard
  • 155
  • 4