Regex help please to remove Special characters inside xml tags

PaulSc

Hi… I work in a “locked” down environment working with XML files and only really have the vanilla version of NotePad++ to help i.e no plugins :-(

I’m after some help please with a regex to help me edit/correct xml files where a specific tag needs to be updated to remove “Special characters” i.e ?'s spaces *'s ^'s and perhaps most importantly /'s…whilst maintaining the xml tags and whats left of the data after the special chars have gone…

For example
<Ref>1234/567890</Ref> or
<Ref>123 TEST 23/*</Ref>

We need to remove the offending characters but keep the rest of the text/tags to leave
<Ref>1234567890</Ref> or
<Ref>123TEST23</Ref>

Our XSD validation only allows 0-9 and A-Z between the <REF></REF> tags…so anything else is “special”

Weve tried various one-liners without much success and are after a bit of advice/pointers please especially as were now looking at 6 million row files daily…

Any help gratefully received!..

Cherrs
PaulSc.

Terry R

@PaulSc
First off I’m wondering if Notepad++ can deal with these VERY large files you mention (6 million row files). Regardless of that the following regex will I think suffice your needs. As it stands it needs to be run multiple times as each time it will only pull 1 ‘unwanted’ character from each <Ref></Ref> tag. So you keep running it until the result is “0 occurrences were replaced” There may well be a better method and if anyone can identify it @guy038 can.
So using the Replace function we have
Find What:(?i)(<Ref>.*?)[^a-z0-9](?=.*</Ref>)
Replace With:\1

Use the Replace All and have wrap around ticked, along with search mode as ‘regular expression’. Run it until you get the above response (0 occurences).
The (?i) allows for upper and lower case characters. So each time it checks for the start of the tag, this prevents removing unwanted characters not inside the tag, along with the lookahead (?=.*</Ref>).

I have not catered for having the tag split across lines, if this is so you will need to provide a fuller description of the data.

Have a go and let us know either way.

Terry

Robin Cruise

Find What:(?i)(<Ref>.*?)[^a-z0-9](?=.*</Ref>)
Replace With:\1

Seems that your regex does not work for the both line (if I press Replace All, the / and * are still there). Maybe you should consider at least 2 lines, not just the first one / or last one.

Terry R

@Robin-Cruise
I think you missed the point. I mentioned it will only “pull 1 unwanted character from each tag each time it runs”. Later on I said:

Run it until you get the above response (0 occurences).

So it definitely pulls every unwanted character from each of the tags, it just takes them 1 at a time, so it needs running multiple times until there are none left to pull.

Terry

guy038

Hello @paulsc, @terry-r, @robin-cruise and All,

I found out a nice regex which can do the job in one go, only !

I was inspired by the generic regex, discussed here :

https://notepad-plus-plus.org/community/topic/16533/how-to-remove-empty-spaces-from-a-particular-tag-regular-expression/6

So, I began with this regex :

(?s)(\G|<Ref>)((?!<|>).)*?\K[^\r\n\w<>]+

After some tests I realized that, instead of the negative look-ahead (?!<|>). syntax, we could simply use the negative class character [^<>].

Indeed, the interest of the negative look-ahead is obvious when you need to avoid certain strings. For instance, the regex (?i)123((?!A simple test|abc|OK).)*789 assures you that the range of characters between the numbers 123 and 789 should never contain any string A simple test, abc and OK, all together, whatever the case, in order to get a match !

But, when you simply need to avoid some single characters, the negative class character syntax is easier to understand, to my mind ! For instance, the regex 123[^#@+]*789 assures you that the range of characters between the numbers 123 and 789 should never contain any character #, @ and +, all together, in order to get a match !

Just notice, that inside a class character, most of symbols do not need to be escaped, in order to be considered as literals ! The only characters, which need to be preceded with the \ escape symbol, are :

The - dash which defines a range of characters allowed or forbidden
The two square brackets [ and ], which are the boundaries of character class
The ^ caret symbol, which defines a negative character class
The \ escaping char, itself

So, for instance :

The regex [\^&\\~\]=] would look for any char ^, &, \, ~, ] or =, whereas
The regex [^(\[)\-%] would look for any char, different from (, [, ), - and %

Of course, according to specific locations, inside the character class, the \ character is, then, not mandatory but be aware that the escaping way is the safe method, in all cases !

But, let’s get back to our problem ! So, we can modify the above regex as :

(\G|<Ref>)[^<>]*?\K[^\r\n\w<>]+

Notes :

I suppressed the (?s) modifier as the regex does not contain any dot char, anymore !
The negative class character, [^\r\n\w<>], at the end of the regex, contains all the characters that should… not be deleted i.e. :
- The line-break characters have to be kept, of course !
- Any word character \w, which is an equivalent of the regex [A-Za-z0-9_], as well as all accentuated characters
- The angle brackets < and >, surrounding starting and ending tags
Note that, if you would keep, between the two tags <Ref> and </Ref, let say, the colon punctuation sign, the dollar currency sign and a space character, simply change the regex as below :

(\G|<Ref>)[^<>]*?\K[^\r\n\w<>:$ ]+

So, this regex , after matching the starting tag <Ref> or from the location of the end of the previous match, \G, tries to match, first, the shortest range of characters, even null, all different from the angle brackets, [^<>]*?
Then, due to the \K syntax, the regex engine forgets all that was matched, so far and just considers the regex part [^\r\n\w<>]+, which represents the greatest range of consecutive symbols which have to be deleted

In summary, assuming the sample text below, containing <Ref>..........</Ref> regions, with two ranges in a first line, the next one, split in several lines, and a single region, at the end, after some line-breaks :

<Ref>1234/567*89012345678+90123!!4567890</Ref>                           <Ref>1234/567-89012345678_90123?4567890</Ref>

<Ref>&1234@@567$$$89
01234||
|5678@9012.34
567890%</Ref>




<Ref>*1234\567$$$8901234 5678@90123!!4567890#</Ref>

After performing the regex S/R :

SEARCH (\G|<Ref>)[^<>]*?\K[^\r\n\w<>]+

REPLACE Leave EMPTY

We get, with only ONE click on the Replace All button, the expected results :

<Ref>123456789012345678901234567890</Ref>                           <Ref>123456789012345678_901234567890</Ref>

<Ref>123456789
01234
5678901234
567890</Ref>




<Ref>123456789012345678901234567890</Ref>

Magic isn’t it :-)) The power of these tiny bits of code is really impressive and will always surprise me !

Best Regards,

guy038

Terry R

@guy038
I REALLY like your thinking. There’s only 1 concern I have and luckily you actually represented it in your example. On the first line in the 2nd tag you have the “_” (underscore) character.
Now
@PaulSc said:

Our XSD validation only allows 0-9 and A-Z between the <REF></REF> tags…so anything else is “special”

so I’m wondering if the use of the “\w” is perhaps a bit too encompassing. I’ve been doing some testing using your idea (forgive me) and to my mind the power of your regex really comes from using the “\G”, which allows it to “stay put” after a find, rather than mine which continually had to reset after each find starting with looking for the next “<Ref>” sequence.

Looking forward to whether the “\w” can be constrained in an elegant way so it more exactly fits the OP’s request.

Terry

guy038

Hello @paulsc, @terry-r, @robin-cruise and All,

Yes , Terry, everything follows from the zero-length assertion \G, indeed !

Roughly, the regex (\G|<Ref>)[^<>]*?\K[^\r\n\w<>]+, right after the string <Ref> or after the end of the previous match, looks for the shortest range of chars different from, either, < and > till a non-wanted char, different , which will be deleted, during the replacement phase.

So, let’s imagine the sample text :

Bla bla blah

<Ref>1234567/8901234567890123====4567890</Ref>     +++      ###    <Ref>1234567890123456789012345@@67890</Ref>

The start location is before the upper-case letter B. Obviously, the match cannot begin at this position ( \G ) as, in order to reach the first unwanted char /, the range would cross the starting tag <Ref>, which contains < and >, of course !
So, the regex engine is forced to look for the second alternative, i.e. the string <Ref>. Below, I marked, with some bullet chars •, the range of characters, between <Ref> and the unwanted char /

Bla bla blah

<Ref>1234567/8901234567890123====4567890</Ref>     +++      ###    <Ref>1234567890123456789012345@@67890</Ref>
     •••••••

Now, the range of characters, before the next unwanted chars ==== do start from current location ( end of previous match, right after the / ), till the ==== string, as indicated below :

Bla bla blah

<Ref>1234567/8901234567890123====4567890</Ref>     +++      ###    <Ref>1234567890123456789012345@@67890</Ref>
             ••••••••••••••••

You agree that current location of the regex engine is, now, right after the ==== string. Well, what are the next unwanted char(s) ? Obviously, the +++ string. But, in that case, the regex engine would cross the </Ref> string, which contains the forbidden chars < and >, as we defined for any range.
Then, necessarily, the start of the next match will be located some chars after current location, so that the \G assertion is not true, anymore ! Thus, the regex engine starts looking, again, for a starting tag <Ref>, followed by some chars till the unwanted chars @@, giving :

Bla bla blah

<Ref>1234567/8901234567890123====4567890</Ref>     +++      ###    <Ref>1234567890123456789012345@@67890</Ref>
                                                                        •••••••••••••••••••••••••

Now, current location is right after the @@ Is there any other unwanted character to reach, with, before, a range without the < and > chars ? None. So, the S/R process ends !

Regarding the low line char _, Terry you’re right. The correct regex, needed, with the OP would rather be :

(?i)(\G|<Ref>)[^<>]*?\K[^\r\n<>a-z0-9]+

But doing so, all accentuated chars ( as, for instance, é or à ) are, also, deleted :-( In that case, prefer the regex

(\G|<Ref>)[^<>]*?\K([^\r\n<>\w]|_)+

Cheers,

guy038

Terry R

@guy038
GREAT explanation of the use of the “\G”. I knew the “descriptor” for it but was a bit unsure how it might be used into a regex.

I’d happily suggest that explanation (in a more generic form) would make a great FAQ, if only we had one for the individual metacharacters used in regex. Maybe we DO need a “FAQ for Metacharacters”, as you (and others) have provided many examples in the past. I try to refer to them when unsure, but they can be hard to find amongst ALL the posts.

You mention again the use of accentuated characters. Obviously in the English speaking world we have little to do with them (in normal life). I would also suggest that the data the OP is using would likely not have them either, so you providing both alternatives gives him maximum options to satisfy his needs.

Cheers
Terry

Scott Sumner

@Terry-R

While waiting for a FAQ on it, you’ll find more fun \G discussion here !