Hi, @carolin-marschke,
I supposed that you’re speaking about the tag , as below, which starts, generally, an XML file !
<?xml version=“1.0” encoding=“UTF-8” ?>
Your troubles come from the Question mark character ( ? ), which is considered, in the regex universe, as a special character, also called a meta-character. So, by default, these characters are NOT simple literal characters !
There are two different sets of meta-characters :
Those, which are recognized anywhere in the pattern, EXCEPT within square brackets :
\ General escape character, with several uses
^ Start of a line ( or a file )
$ End of a line ( or a file )
. Match any character, except new-line ones ( by default )
[ Start a class definition or a class range
| Start of an alternative branch
( Start of a sub-pattern group
) End of a sub-pattern group
{ Start a Min / Max range of a quantifier
} End a `Min / Max range of a quantifier
* 0 to more times, the preceding character or group
+
1 to more times, the preceding character or group
Possessive behaviour of the quantifiers *, + and ?
?
0 or 1 time, the preceding character or group
Meaning extender, for groups or conditions, (....)
Minimizer of the quantifiers *, + and ?
Those, which are recognized within square brackets ( character class ), EXCLUSIVELY :
\ General escape character
^ Negate the class, if first character of the class
- Character range indicator
[: Start of a POSIX character class, if followed by regular POSIX syntax
:] End of a POSIX character class
So, Carolin, if you need to search for any of the above characters, as a literal, you must escape it with the backslash character \
Therefore, your regex must be rewritten as : (?i)<\?xml[^>]*>, with a \, right before the ? character !
However, the regex (?-is)<\?xml.+?> give better results ! Indeed, due to the -s modifier, any dot will match standard characters, only, with case sensitivity. So, after matching the literal string <?xml, with that exact case ( <\?xml ), it looks for the shortest non-null range of characters ( .+? ), of the current line, till a closing symbol >
Assuming that the unique XML line <?xml version=“1.0” encoding=“UTF-8” ?> would be split into the four lines, as below :
<?xml ver
sion="1.0" enc
oding="UT
F-8" ?>
My regex doesn’t match this incorrect text, unlike your regex ! To get the same behaviour, you should change your regex as (?i)<\?xml[^>\r\n]*>. This time, the syntax [^>\r\n] matches any character, different from the > character AND different, also, from any EOL character :-))
Cheers,
guy038