Regex: Delete all signs/operator like (comma) from tags < >
hello. I have to delete all commas from tags:
<!DOCTYPE, html> <html, xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="ro"> <title>I, love, Myself, Now | Nick, Francisco</title> <meta, property="fb:admins" content="1446157242"/> <link, rel="sitemap" type="application/rss+xml" href="rss.xml" /> <meta, name="googlebot" content="index,follow"/> <link, rel="shortcut, icon" href="goiu.ico"/> <meta, http-equiv="X-UA-Compatible" content="IE=edge"/>
I made a regex, but is not too good.
PeterJones last edited by PeterJones
You have been around long enough to know how to ask a more complete question than that.
What do you think that regex is doing? If you hadn’t been around so long, I’d think you were throwing a fake regex at us to try to convince us that you had put effort into the solution when really you hadn’t – but you’ve been here long enough that I hope you wouldn’t be that rude to us. So, why do you think it would have had any chance of matching your data and doing what you want? I know you said it didn’t work, but you had to have a reason for thinking it would.
I mean, for example, the
<,alternation of your regex is obviously never going to match, because you never have
<immediately followed by
,– so why have it in your regex at all?
\Kresets the match, so the only thing in your matching expression that would possibly be deleted is 2 or more whitespace characters (assuming the condition before the
\Kis met). Why do you think that would delete a comma?
You’re “spec” is also quite ambiguous. You gave data you have, but didn’t show what you wanted the data to become, and your phrasing of “delete all commas from tags” is not very specific. You didn’t say whether you wanted to keep or get rid of the commas in
<title>I, love, Myself, Now | Nick, Francisco</title>– some would say that’s inside the title tag because it’s the content of the title tag, others would say only the stuff inside the angle brackets is inside the title tag. You didn’t say whether the commas inside
content="index,follow"should be deleted – some would say that the value of the attribute is inside the meta tag because it’s between the angle brackets, but others would say that because it’s the value of the attribute it should be left alone.
If all you were wanting was getting rid of the obviously invalid-html commas immediately after the tag name, I would use a regex like
(?-s)<!?\w+\K,and replace with empty. This disables dot-matches-newline, requires
<!followed by one or more word characters, and will throw away the comma if the comma comes immediately after that (because the comma is the only thing after the
\K), thus yielding:
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="ro"> <title>I, love, Myself, Now | Nick, Francisco</title> <meta property="fb:admins" content="1446157242"/> <link rel="sitemap" type="application/rss+xml" href="rss.xml" /> <meta name="googlebot" content="index,follow"/> <link rel="shortcut, icon" href="goiu.ico"/> <meta http-equiv="X-UA-Compatible" content="IE=edge"/>
But if some of the other commas in your examples should also be deleted, or you had a case like
<meta name="googlebot", content="xxx">that you wanted to convert to
<meta name="googlebot" content="xxx">(ie, delete the comma between the attributes), but didn’t bother showing us this completely different condition, then the regex will be very different.
So, to sum up, please explain two things:
- Why did you think the regex you gave us would have a chance of working?
- And what is the real definition (“spec”) of what you want to do? Please describe under what circumstances a comma in your original data should be deleted, and under what circumstances it should be kept. Show data both before and after the transformation that you want. Make sure that the data has enough of your edge cases so that we can tell what happens to commas in various situations.
guy038 last edited by
Do you mean that you want to delete any comma character
,, except when located within
2double quotes ?
If so, use this simple regex S/R :
"........"zone is simply rewritten, whatever its contents. Else, if a comma is matched, it is deleted because group
1does not exist ;-))
Vasile Caraus last edited by Vasile Caraus
hello @guy038 . Yes, ok, but it must select only the comma from tags <>. Your regex is fine, but it selects all comma from my documents. I need strictly from tags <>
Alan Kilborn last edited by
PeterJones last edited by
must select only … I need strictly
If you want the help, put in the effort. I outlined what information we would need from you to be able to help you.
Since we obviously cannot read your mind, then please provide the information. Then again, if what you really want is for us to guess wrong so that you feel justified to complain that we haven’t got the right regex, then you are getting exactly what you want – and being very rude to the volunteers trying to help you in the process.
I’ve already given you the freebie. As far as I can tell, based on my interpretation of your desires, my solution will give you exactly what you want. If it doesn’t, you have to read my entire post, and answer all the questions asked there; otherwise, we cannot help you any better than we already have.
So, somebody else gives me the right answer.
PeterJones last edited by
somebody else gives me the right answer.
It would’ve been nice if you had linked that “somebody else”'s reply, since it obviously wasn’t in this thread.
Taking that regular expression as “golden”, I can now answer the questions that you refused to answer: you wanted any comma inside the angle brackets
<...>to be removed, whether they were inside the quotes are not. Since you refused to clarify that, despite multiple requests, we couldn’t read your mind (and we were getting tired of being asked to read your mind and do your work for you without any effort on your part).
Once again, I will say: if you want help on this (or any) forum, you need to show some effort, and answer the questions that are asked of you to clarify your problem.
Alan Kilborn last edited by
guy038 last edited by guy038
@vasile-caraus, let me show you how you could have written your initial post, in order to get a quick response from the N++ Community ;-))
Let’s assume that I have an initial text, as below :
<div> <font face="arial, verdana, tahoma, trebuchet" size="25"> <font face="Comic Sans MS" size="2,5"> <font size="2,5"> <span class="1184,6401,5220">Text2, with commas, for 1, or several, regex tests</span> </font> </font> </font> </div>
I’m trying to get this output text ( Certainly not a valid
HTMlLcode and just for explanations ! )
<div> <font face="arial verdana tahoma trebuchet" size="25"> <font face="Comic Sans MS" size="25"> <font size="25"> <span class="118464015220">Text, with commas, for 1, or several, regex tests</span> </font> </font> </font> </div>
As you can see, I would like to delete any comma symbol, which is inside any tag zone
However, commas, laying outside
<......>zones, would not be suppressed, as in the range
Text, with commas, for 1, or several, regex testsin the
5thline of the output text !
I already tried the regex S/R, below, without complete success, because, sometimes, it just deletes the last comma of the current tag and not all :-((
Could you please, tell me what is the correct regex to process or, at least, point me in the right direction ? Many thanks for your future help !
The most important thing, Vasile, is to show, at the same time, a general view of your initial text AND of the output text, that you expect to !
Now, your search regex
(<\w+|\G)((?!>).)*?\K,does act as expected and leave you with the output text, above !
I even found out an alternate and shorter syntax, which does not need the negative look-ahead
(?!>)and use, instead, a negative class character
thanks a lot, @guy038