Help with regular expression
-
Could you please help me with the following search-and-replace problem I am having?
I need to mass replace a bunch of text in thousands of files to fix some file paths.
Here is the data I currently have (“before” data):
"id" : "file.name.latest:/Custom/path/file.exe" "uid" : "file.name.latest:/Custom/path/file.exe" "sid" : "file.name.latest:/Custom/path/file.exe"
Here is how I would like that data to look (“after” data):
"id" : "Custom/path/file.exe" "uid" : "Custom/path/file.exe" "sid" : "Custom/path/file.exe"
To accomplish this, I have tried using the following Find/Replace expressions and settings
Find What =
(?<=")[^"]+\.latest:(?=/)
Replace With =
Search Mode = REGULAR EXPRESSION
Dot Matches Newline = YesThis expression as given to me from a user here and I have no understanding of how regular expressions work even after hours of trying to read the documentation. The simple problem is this expression does not get rid of the / after latest: and I am unable to fix it because adding / breaks the expression.
To recap the before data is:
"id" : "file.name.latest:/Custom/path/file.exe" "uid" : "file.name.latest:/Custom/path/file.exe" "sid" : "file.name.latest:/Custom/path/file.exe"
The result with the above expression is:
"id" : "/Custom/path/file.exe" "uid" : "/Custom/path/file.exe" "sid" : "/Custom/path/file.exe"
And I need it to be:
"id" : "Custom/path/file.exe" "uid" : "Custom/path/file.exe" "sid" : "Custom/path/file.exe"
It seems simple but apparently I am too dumb with this stuff. Can anyone help me with this please?
—
moderator added code markdown around text; please don’t forget to use the
</>
button to mark example text as “code” so that characters don’t get changed by the forum -
@Ben-1 said in Help with regular expression:
Can anyone help me with this please?
Yes, we can help, but first. Why was your account deleted that was associated with the previous (same) question posted by you?
And secondly, in that previous post there was a request to you to enter examples inside of a code box (link to template for search/replace questions). If you do that then you will get much better responses.As was stated then, it’s very hard to help if you don’t tell us what exactly you need, with before and after examples. It would seem this request whilst very similar to the previous one, has a slight difference.
Terry
-
I am unable to post in that thread because it keeps saying “I need 1 reputation to post links” when I am not posting links. I tried deleting account and it’s still happening. I googled and found others having same issues with this forum. This is a test post after reading other users with same issue…
EDIT: Okay if I reply to myself, it works. If I reply to you or anyone else in threads, I cannot post. Also… I used the template as requested in that thread? I did copied/pasted that template here? I am completely unaware of what more you want me to do. I’m not trying to be difficult but I am a novice user where it comes to notepad++ and never used this forum interface in my life. I feel like my examples are very clear and I gave two of them, showed the data I needed edited, what the current expression does to it, and what I need it to do.
-
-
@Ben-1
I’ve given you a rating of 1 (vote). See if that helps you. Yes we’ve had an unfortunate situation where posting for new posters (such as yourself) has hit a stumbling block.
Terry
PS, well, actually you didn’t use the template correctly. if you had the examples would have appeared inside of a code block. You will be able to see this in other posts, such as this one.
Easiest way of entering examples in a code block is to insert the examples in the post, then select the examples as one selection and hit the
</>
icon just above the posting window. Then submit the post. -
@Ben-1 said in Help with regular expression:
I am unable to post in that thread because it keeps saying “I need 1 reputation to post links” when I am not posting links. I tried deleting account and it’s still happening. I googled and found others having same issues with this forum. Testing something.
When you hit Quote to quote the text in your reply, it says
@someone said in [Discussion Name Here](/post/#####):
. You could have just deleted that portion, and it will let you post. But since @Terry-R has upvoted you, you will be able to reply now, even without deleting that portion.I am honestly completely clueless what more I could do to more clearly describe my problem to you…
When you type/paste in you example text, highlight it, and hit the
</>
button on that toolbar.That will format your text in the boxes, like you post now shows (I get tired of all these questions devolving into convincing people to actually read the FAQs that i spend hours developing to avoid such problems, and just ended up using my moderator power to do it for you; but if you tried to read the FAQs (which i linked you to before), you could have saved everyone – including yourself – a lot of extra back-and-forth and post editing and wasted time and effort).
The Template, which you claim to have used, had sections that had a line with
```
at the beginning and end. Assuming you really copy/pasted from that template, then you deleted those lines, which made it completely pointless for you to have used the template.This expression as given to me from a user here and I have no understanding of how regular expressions work even after hours of trying to read the documentation.
(?<=")[^"]+\.latest:(?=/)
(?<=")
= the character"
must come before the match (so it won’t be replaced) (see the manual on “lookbehind”)[^"]
= matches any character except for a quote mark (“negative character class”)+
= match one or more characters that match the previous rule (hence “one or more characters that are not a quote character”)\.
= match a literal period (“escape” a special character, because by default,.
has special meaning to a regex)latest:
= literal text to match(?=/)
= the match must be followed by a/
character (“lookahead”)
Once you know that, then you can tell that the culprit is that you wanted the
/
to be part of the match to be replaced, instead of to come immediately after teh match. Thus, convert(?=/)
to/
in the regex:(?<=")[^"]+\.latest:/
----
Useful References
-
@PeterJones said in Help with regular expression:
(?<=")[^"]+\.latest:(?=/)
(?<=")
= the character"
must come before the match (so it won’t be replaced) (see the manual on “lookbehind”)[^"]
= matches any character except for a quote mark (“negative character class”)+
= match one or more characters that match the previous rule (hence “one or more characters that are not a quote character”)\.
= match a literal period (“escape” a special character, because by default,.
has special meaning to a regex)latest:
= literal text to match(?=/)
= the match must be followed by a/
character (“lookahead”)
Once you know that, then you can tell that the culprit is that you wanted the
/
to be part of the match to be replaced, instead of to come immediately after teh match. Thus, convert(?=/)
to/
in the regex:(?<=")[^"]+\.latest:/
----
Thanks for fixing the post and the explanation. I read it, then read it again, then read it again, and sadly it still just goes right over my head. I am pretty sure I am dyslexic when it comes to this stuff and had to drop out of basic coding in college because I couldn’t grasp that either.
The good news is… the new expression you gave me works. Thanks :)
-
@Ben-1 said in Help with regular expression:
I am pretty sure I am dyslexic when it comes to this stuff
Please don’t. You are doing both yourself and those with true dyslexia a gross disservice: dyslexia is real, not just an excuse for not learning.
If you truly believe you have dyslexia, seek out an expert, because there are ways that they can help you.
If you are just saying you “cannot learn it” because it’s hard, that’s not the same as dyslexia. And it’s just a lie that the culture has foisted upon you. Stop believing that lie, and stop perpetuating that lie as an excuse for not putting in the effort needed.
As much as it will pain you to hear this, “hours” might not be enough for you to make it past the first step. Why did teachers assign you hundreds or thousands of simple arithmetic questions for homework in early gradeschool? Was it because they wanted to torture you? No, it’s because some things in life take effort and repetition.
The search-and-replace question you asked was a lot more complex than 1+1=2, so you shouldn’t expect yourself to become fluent in it instantly. However, if you start with the simple things, and try them out on multiple problems, until you actually understand that, and then move on to the next thing. The best place to start in regex is with the simple rules: most text in a regex will match literal text; the special character
.
will match any single character, the special sequence.+
will match one or more character, and the special sequence.*
will match zero or more random characters. With that, you can get the rudimentary effect of “I want to search for something, even when there are a few characters I don’t know”. Then that will say.*latest:/
will match zero or more characters, followed by the literal stringlatest:/
… Then, after you’ve figured out how useful those dot-sequences can be, you can start applying the more complicated stuff. -
Thanks.
(?<=")[^"]+\.latest:/
This expression is working great with what I need it for, however I have a few very large text files where it is producing the error “The complexity of matching the regular expression exceeded predefined bounds. This exception is thrown to prevent “eternal” matches that take an indefinite period time to locate.”
This may just be that notepad++ is thinking the file is taking too long to process because it’s such a large text file. Newline is checked, so I don’t think it should be throwing this error.
Is there anything I can try? Thanks again.
-
@Ben-1 ,
[^"]
means “any character that is not a quote”. Newlines are characters that are not quotes. If you really want “any character that is not a quote or newline”, use[^"\r\n]
. Also, change the+
(which means “one or more of the previous”) to+?
(which means “one or more of the previous, but don’t be greedy about it”) so that it’s trying to match less at the same time. -
@PeterJones said in Help with regular expression:
@Ben-1 ,
[^"]
means “any character that is not a quote”. Newlines are characters that are not quotes. If you really want “any character that is not a quote or newline”, use[^"\r\n]
. Also, change the+
(which means “one or more of the previous”) to+?
(which means “one or more of the previous, but don’t be greedy about it”) so that it’s trying to match less at the same time.Same results with this sadly:
(?<=")[^"\r\n]+?\.latest:/
The file in question has 5.5 million lines of text in it (296MB), maybe it’s just too much.
-
@Ben-1 said in Help with regular expression:
The file in question has 5.5 million lines of text in it (296MB), maybe it’s just too much.
I have just come up with an alternative regular expression (regex) which successfully processed 7.5 million lines (all based on your examples).
It is
Find What:"[^:]+:/
Replace With:"
Before running it I would suggest counting the lines with a
:/
in them. This is just to confirm that the regex won’t edit incorrectly some of your lines. To count, use the Find function and type:/
into the Find What field, click on Count. The number found has to equal the number of lines as seen at the bottom of the Notepad++ window. This gives us a high level of certainty that all lines contain those characters.Possibly use my regex on a copy of your (large) file(s) in case you are uncertain.
Terry
PS the processing did take about 10 minutes, but no error such as you got!
-
@Ben-1 said in Help with regular expression:
(?<=")[^"\r\n]+?\.latest:/
The file in question has 5.5 million lines of text in it (296MB), maybe it’s just too much.
I thought there might be too many quotes… try changing to
(?<=: ")[^"\r\n]+?\.latest:/
… that should only match things that come after
: "
, which should help narrow down the range of matches.I just pasted
"id" : "file.name.latest:/Custom/path/file.exe" "uid" : "file.name.latest:/Custom/path/file.exe" "sid" : "file.name.latest:/Custom/path/file.exe"
4 million times – well, really, select-all, copy, paste, so it doubles every time, so it’s really 32ⁿ=32²²=3*4.194M=12.582M lines, so there are 625MB. It does not claim it’s invalid, and while it takes a long time, it seemed to be working.
I first tried with the original, but accidentally had a space before the
(?<=")
, and it gave me the error you showed when I tried to FIND; when I deleted that space, and did just a FIND, it had no problem finding the next few occurrences. So it’s not giving the error. Though when I did the Replace All, it’s taken at least 5 minutes without any apparent progress… Notepad++ does have performance issues with files that are more than a hundred megabytes.Unfortunately, after another 5 minutes while I wrote up my reply, it still hasn’t finished for me, so I’m cancelling it – oh, no, just as I wrote that, it finished. The search with
(?<=")[^"\r\n]+?\.latest:/
worked for me with 625MB.But maybe your machine has different memory limitations than mine does.
Try seeing if the more-restrictive
(?<=: ")[^"\r\n]+?\.latest:/
works for you. If you have a way to break up the 300MB file into smaller chunks, I might recommend trying that, too.And Terry posted this while I was writing mine up:
Find What:
"[^:]+:/
I’d recommend a modified
"[^:"\r\n]+?"
, otherwise it might still hit the problem of trying to wrap lines and being too greedy, if you have some data in the middle that doesn’t fit your simple format -
This is strange. None of those work for these larger files. Sadly, these files contains confidential information otherwise I’d just upload it. Tried restarting the computer just to make sure it wasn’t some odd memory issue, but this system has 64GB of RAM on 64bit OS so it shouldn’t be a memory limitation.
There are are multiple "s in every single line, and it seems to break after millions of these entries:
“BFCE9424314C03B3E”,
“B9603434384AC353E”,
“B00204343730E233E”,
{
“t” : “13.68”,
“v” : “-0.7822026”,
“c” : “3”,
“i” : “-0.7824571”,
“o” : “-0.7819693”
},I am trying to warp my head around a better way to detect what I need to replace/delete without simply looking for ". The text before what I need to replace always contains the following:
" : "
So any text after (not including it) I need to delete, up to and including this text.
.latest:/
That might reduce the amount of things it needs to look for?
-
@Ben-1
With your latest “examples” it seems that your file isn’t just lines of what you originally wanted edited, rather it contains a mixture of other types of lines, is that correct?If so, then another method might be to:
- number all the lines
- mark all the lines which contain the text to be edited
- cut these lines and paste into a blank tab
- edit these lines in the new tab
- paste the edited lines back
- re-sort the lines so they are back in line order
- remove the line numbers that were added in step #1
Terry
-
@Terry-R
It would take thousands of hours with the amount of data and files there are. I think I am just going to tell this client I cannot figure out how to achieve what he wants and not bill him. I’ve already put hours of my personal time trying to figure it out and I thought it’d just be a pretty simple notepad++ mass edit.For some reason the problem has gotten even worse, now the original expression is producing the error on files as little as 80MB, files that successfully processed yesterday with this exact expression.
(?<=")[^"]+\.latest:/
I do appreciate the help given though.
-
@Ben-1 I can’t promise this will work, but I would try:
Find what :
^([^:\r\n]++: ")[^:\r\n"]++(?<=\.latest):/
Replace with:$1
-
@Coises said in Help with regular expression:
@Ben-1 I can’t promise this will work, but I would try:
Find what :
^([^:\r\n]++: ")[^:\r\n"]++(?<=\.latest):/
Replace with:$1
Thanks, it produces the results
Can’t find the text “^([^:\r\n]++: “)[^:\r\n”]++(?<=.latest):/” -
FWIW no regex-based solution to this problem really addresses the core issue here, which is that you are trying to use regular expressions to parse a non-regular language, namely JSON. Regex are awesome, but they are not the right tool for every job.
I didn’t post earlier here because the examples you gave earlier looked like they might not be syntactically valid JSON, and therefore would probably be impossible for most JSON libraries to work with.
For example
"a": "foo" "b": "bar"
is not valid JSON
but{ "a": "foo", "b": "bar" }
is valid JSON even though it looks similar.
And guess what?
{ “a”: ”foo”, ”b”: ”bar” }
is not valid JSON, because the curly quote character
”
is not the same as the ASCII double quote character"
.Hopefully this illustrates the importance of using the code boxes when posting in the forum!
While we’re at it, this also illustrates another important idea, which is that you should assume that every single character is relevant when working with computer data. For example, you may not see the point of the commas after the colon-separated strings, but most JSON parsers are very opinionated and they will refuse to parse your input if they’re missing.
Sadly, the JSON plugins for NPP are largely useless here because you are trying to work with a large number of files.
I would strongly recommend giving up on regular expressions for this task, and instead learn Python and try using the json module for this task.
If you choose to use Python, this forum is not the place to ask for help. Go to a general programming forum instead.
-
@Ben-1 said in Help with regular expression:
@Coises said in Help with regular expression:
@Ben-1 I can’t promise this will work, but I would try:
Find what :
^([^:\r\n]++: ")[^:\r\n"]++(?<=\.latest):/
Replace with:$1
Thanks, it produces the results
Can’t find the text “^([^:\r\n]++: “)[^:\r\n”]++(?<=.latest):/”Hmmm… is your real data a little more complex than your examples? Perhaps you have something like:
"id" : "http://file.name.latest:/Custom/path/file.exe" "uid" : "ftp://file.name.latest:/Custom/path/file.exe" "sid" : "https://file.name.latest:/Custom/path/file.exe"
which includes a colon inside the second set of quotes before the colon in
.latest:/
? If so, then indeed, my expression would not match; fixing that would require knowing more about the exact patterns in your real data. -
@Ben-1 said in Help with regular expression:
It would take thousands of hours with the amount of data and files there are. I think I am just going to tell this client I cannot figure out how to achieve what he wants and not bill him. I’ve already put hours of my personal time trying to figure it out and I thought it’d just be a pretty simple notepad++ mass edit.
Wait, what?! If we had given you a working regex, you would have billed a client and gotten paid for the work we did? Yeesh.