Regex: Delete all signs/operator like (comma) from tags < >

Vasile Caraus

hello. I have to delete all commas from tags:

<!DOCTYPE, html>
<html, xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="ro">
<title>I, love, Myself, Now | Nick, Francisco</title>
<meta, property="fb:admins" content="1446157242"/>
<link, rel="sitemap" type="application/rss+xml" href="rss.xml" /> 
<meta, name="googlebot" content="index,follow"/>
<link, rel="shortcut, icon" href="goiu.ico"/>
<meta, http-equiv="X-UA-Compatible" content="IE=edge"/>

I made a regex, but is not too good.

FIND: (?-s)(\G(?!^)|<,)((?!/>).)*?\K\s\s+

REPLACE BY: (leave empty)

PeterJones

You have been around long enough to know how to ask a more complete question than that.

What do you think that regex is doing? If you hadn’t been around so long, I’d think you were throwing a fake regex at us to try to convince us that you had put effort into the solution when really you hadn’t – but you’ve been here long enough that I hope you wouldn’t be that rude to us. So, why do you think it would have had any chance of matching your data and doing what you want? I know you said it didn’t work, but you had to have a reason for thinking it would.

I mean, for example, the <, alternation of your regex is obviously never going to match, because you never have < immediately followed by , – so why have it in your regex at all?

And the \K resets the match, so the only thing in your matching expression that would possibly be deleted is 2 or more whitespace characters (assuming the condition before the \K is met). Why do you think that would delete a comma?

You’re “spec” is also quite ambiguous. You gave data you have, but didn’t show what you wanted the data to become, and your phrasing of “delete all commas from tags” is not very specific. You didn’t say whether you wanted to keep or get rid of the commas in <title>I, love, Myself, Now | Nick, Francisco</title> – some would say that’s inside the title tag because it’s the content of the title tag, others would say only the stuff inside the angle brackets is inside the title tag. You didn’t say whether the commas inside content="index,follow" should be deleted – some would say that the value of the attribute is inside the meta tag because it’s between the angle brackets, but others would say that because it’s the value of the attribute it should be left alone.

If all you were wanting was getting rid of the obviously invalid-html commas immediately after the tag name, I would use a regex like (?-s)<!?\w+\K, and replace with empty. This disables dot-matches-newline, requires < or <! followed by one or more word characters, and will throw away the comma if the comma comes immediately after that (because the comma is the only thing after the \K), thus yielding:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="ro">
<title>I, love, Myself, Now | Nick, Francisco</title>
<meta property="fb:admins" content="1446157242"/>
<link rel="sitemap" type="application/rss+xml" href="rss.xml" /> 
<meta name="googlebot" content="index,follow"/>
<link rel="shortcut, icon" href="goiu.ico"/>
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>

But if some of the other commas in your examples should also be deleted, or you had a case like <meta name="googlebot", content="xxx"> that you wanted to convert to <meta name="googlebot" content="xxx"> (ie, delete the comma between the attributes), but didn’t bother showing us this completely different condition, then the regex will be very different.

So, to sum up, please explain two things:

Why did you think the regex you gave us would have a chance of working?
And what is the real definition (“spec”) of what you want to do? Please describe under what circumstances a comma in your original data should be deleted, and under what circumstances it should be kept. Show data both before and after the transformation that you want. Make sure that the data has enough of your edge cases so that we can tell what happens to commas in various situations.

guy038

Hi, @vasile-caraus, @peterjones and All,

Do you mean that you want to delete any comma character ,, except when located within 2 double quotes ?

If so, use this simple regex S/R :

SEARCH (?-s)(".+?")|,

REPLACE ?1$0

Any shortest "........" zone is simply rewritten, whatever its contents. Else, if a comma is matched, it is deleted because group 1 does not exist ;-))

Cheers,

guy038

Vasile Caraus

@guy038 said in Regex: Delete all signs/operator like (comma) from tags < >:

?1$0

hello @guy038 . Yes, ok, but it must select only the comma from tags <>. Your regex is fine, but it selects all comma from my documents. I need strictly from tags <>

Alan Kilborn

I think Peter’s reply to @Vasile-Caraus 's first posting also applies in response to his second:

You have been around long enough to know how to ask a more complete question than that.

But I’m sure Peter’s reply was TL;DR for @Vasile-Caraus

PeterJones

@Vasile-Caraus said in Regex: Delete all signs/operator like (comma) from tags < >:

must select only … I need strictly

If you want the help, put in the effort. I outlined what information we would need from you to be able to help you.

Since we obviously cannot read your mind, then please provide the information. Then again, if what you really want is for us to guess wrong so that you feel justified to complain that we haven’t got the right regex, then you are getting exactly what you want – and being very rude to the volunteers trying to help you in the process.

I’ve already given you the freebie. As far as I can tell, based on my interpretation of your desires, my solution will give you exactly what you want. If it doesn’t, you have to read my entire post, and answer all the questions asked there; otherwise, we cannot help you any better than we already have.

Vasile Caraus

So, somebody else gives me the right answer.

Find what: (?:<\w+|\G)(?:(?!>).)*?\K,
Replace with: LEAVE EMPTY

PeterJones

@Vasile-Caraus said in Regex: Delete all signs/operator like (comma) from tags < >:

somebody else gives me the right answer.

It would’ve been nice if you had linked that “somebody else”'s reply, since it obviously wasn’t in this thread.

Taking that regular expression as “golden”, I can now answer the questions that you refused to answer: you wanted any comma inside the angle brackets <...> to be removed, whether they were inside the quotes are not. Since you refused to clarify that, despite multiple requests, we couldn’t read your mind (and we were getting tired of being asked to read your mind and do your work for you without any effort on your part).

Once again, I will say: if you want help on this (or any) forum, you need to show some effort, and answer the questions that are asked of you to clarify your problem.

Alan Kilborn

@PeterJones

I think it’s probably best that in the future @Vasile-Caraus NOT post here, but rather go wherever-else he’s getting the answers he needs.

guy038

Hi, @vasile-caraus, @peterjones, @alan-kilborn, and All,

@vasile-caraus, let me show you how you could have written your initial post, in order to get a quick response from the N++ Community ;-))

Hello, guys,

Let’s assume that I have an initial text, as below :

    <div>
        <font face="arial, verdana, tahoma, trebuchet" size="25">
            <font face="Comic Sans MS" size="2,5">
                <font size="2,5">
                    <span class="1184,6401,5220">Text2, with commas, for 1, or several, regex tests</span>
                </font>
            </font>
        </font>
    </div>

I’m trying to get this output text ( Certainly not a valid HTMlL code and just for explanations ! )

    <div>
        <font face="arial verdana tahoma trebuchet" size="25">
            <font face="Comic Sans MS" size="25">
                <font size="25">
                    <span class="118464015220">Text, with commas, for 1, or several, regex tests</span>
                </font>
            </font>
        </font>
    </div>

As you can see, I would like to delete any comma symbol, which is inside any tag zone <........>.

However, commas, laying outside <......> zones, would not be suppressed, as in the range Text, with commas, for 1, or several, regex tests in the 5th line of the output text !

I already tried the regex S/R, below, without complete success, because, sometimes, it just deletes the last comma of the current tag and not all :-((

FIND (<[^>\r\n]+),

REPLACE BY \1

Could you please, tell me what is the correct regex to process or, at least, point me in the right direction ? Many thanks for your future help !

Best regards,

Vasile Caraus

P.S. :

The most important thing, Vasile, is to show, at the same time, a general view of your initial text AND of the output text, that you expect to !

Now, your search regex (<\w+|\G)((?!>).)*?\K, does act as expected and leave you with the output text, above !

I even found out an alternate and shorter syntax, which does not need the negative look-ahead (?!>) and use, instead, a negative class character [^>\r\n]

SEARCH (<|\G)[^>\r\n]+?\K,

REPLACE Leave EMPTY

Cheers,

guy038

Vasile Caraus

@guy038 said in Regex: Delete all signs/operator like (comma) from tags < >:

(<|\G)[^>\r\n]+?\K,

thanks a lot, @guy038