Help with regular expression

Terry R

@Ben-1

I’ve given you a rating of 1 (vote). See if that helps you. Yes we’ve had an unfortunate situation where posting for new posters (such as yourself) has hit a stumbling block.

Terry

PS, well, actually you didn’t use the template correctly. if you had the examples would have appeared inside of a code block. You will be able to see this in other posts, such as this one.

Easiest way of entering examples in a code block is to insert the examples in the post, then select the examples as one selection and hit the </> icon just above the posting window. Then submit the post.

PeterJones

@Ben-1 said in Help with regular expression:

I am unable to post in that thread because it keeps saying “I need 1 reputation to post links” when I am not posting links. I tried deleting account and it’s still happening. I googled and found others having same issues with this forum. Testing something.

When you hit Quote to quote the text in your reply, it says @someone said in [Discussion Name Here](/post/#####): . You could have just deleted that portion, and it will let you post. But since @Terry-R has upvoted you, you will be able to reply now, even without deleting that portion.

I am honestly completely clueless what more I could do to more clearly describe my problem to you…

When you type/paste in you example text, highlight it, and hit the </> button on that toolbar.

That will format your text in the boxes, like you post now shows (I get tired of all these questions devolving into convincing people to actually read the FAQs that i spend hours developing to avoid such problems, and just ended up using my moderator power to do it for you; but if you tried to read the FAQs (which i linked you to before), you could have saved everyone – including yourself – a lot of extra back-and-forth and post editing and wasted time and effort).

The Template, which you claim to have used, had sections that had a line with ``` at the beginning and end. Assuming you really copy/pasted from that template, then you deleted those lines, which made it completely pointless for you to have used the template.

This expression as given to me from a user here and I have no understanding of how regular expressions work even after hours of trying to read the documentation.

(?<=")[^"]+\.latest:(?=/)

(?<=") = the character " must come before the match (so it won’t be replaced) (see the manual on “lookbehind”)
[^"] = matches any character except for a quote mark (“negative character class”)
+ = match one or more characters that match the previous rule (hence “one or more characters that are not a quote character”)
\. = match a literal period (“escape” a special character, because by default, . has special meaning to a regex)
latest: = literal text to match
(?=/) = the match must be followed by a / character (“lookahead”)

Once you know that, then you can tell that the culprit is that you wanted the / to be part of the match to be replaced, instead of to come immediately after teh match. Thus, convert (?=/) to / in the regex: (?<=")[^"]+\.latest:/

----

Useful References

A Former User

@PeterJones said in Help with regular expression:

(?<=")[^"]+\.latest:(?=/)

(?<=") = the character " must come before the match (so it won’t be replaced) (see the manual on “lookbehind”)

[^"] = matches any character except for a quote mark (“negative character class”)

+ = match one or more characters that match the previous rule (hence “one or more characters that are not a quote character”)

\. = match a literal period (“escape” a special character, because by default, . has special meaning to a regex)

latest: = literal text to match

(?=/) = the match must be followed by a / character (“lookahead”)

Once you know that, then you can tell that the culprit is that you wanted the / to be part of the match to be replaced, instead of to come immediately after teh match. Thus, convert (?=/) to / in the regex: (?<=")[^"]+\.latest:/

----

Thanks for fixing the post and the explanation. I read it, then read it again, then read it again, and sadly it still just goes right over my head. I am pretty sure I am dyslexic when it comes to this stuff and had to drop out of basic coding in college because I couldn’t grasp that either.

The good news is… the new expression you gave me works. Thanks :)

PeterJones

@Ben-1 said in Help with regular expression:

I am pretty sure I am dyslexic when it comes to this stuff

Please don’t. You are doing both yourself and those with true dyslexia a gross disservice: dyslexia is real, not just an excuse for not learning.

If you truly believe you have dyslexia, seek out an expert, because there are ways that they can help you.

If you are just saying you “cannot learn it” because it’s hard, that’s not the same as dyslexia. And it’s just a lie that the culture has foisted upon you. Stop believing that lie, and stop perpetuating that lie as an excuse for not putting in the effort needed.

As much as it will pain you to hear this, “hours” might not be enough for you to make it past the first step. Why did teachers assign you hundreds or thousands of simple arithmetic questions for homework in early gradeschool? Was it because they wanted to torture you? No, it’s because some things in life take effort and repetition.

The search-and-replace question you asked was a lot more complex than 1+1=2, so you shouldn’t expect yourself to become fluent in it instantly. However, if you start with the simple things, and try them out on multiple problems, until you actually understand that, and then move on to the next thing. The best place to start in regex is with the simple rules: most text in a regex will match literal text; the special character . will match any single character, the special sequence .+ will match one or more character, and the special sequence .* will match zero or more random characters. With that, you can get the rudimentary effect of “I want to search for something, even when there are a few characters I don’t know”. Then that will say .*latest:/ will match zero or more characters, followed by the literal string latest:/ … Then, after you’ve figured out how useful those dot-sequences can be, you can start applying the more complicated stuff.

A Former User

Thanks.

(?<=")[^"]+\.latest:/

This expression is working great with what I need it for, however I have a few very large text files where it is producing the error “The complexity of matching the regular expression exceeded predefined bounds. This exception is thrown to prevent “eternal” matches that take an indefinite period time to locate.”

This may just be that notepad++ is thinking the file is taking too long to process because it’s such a large text file. Newline is checked, so I don’t think it should be throwing this error.

Is there anything I can try? Thanks again.

PeterJones

@Ben-1 ,

[^"] means “any character that is not a quote”. Newlines are characters that are not quotes. If you really want “any character that is not a quote or newline”, use [^"\r\n]. Also, change the + (which means “one or more of the previous”) to +? (which means “one or more of the previous, but don’t be greedy about it”) so that it’s trying to match less at the same time.

A Former User

@PeterJones said in Help with regular expression:

@Ben-1 ,

[^"] means “any character that is not a quote”. Newlines are characters that are not quotes. If you really want “any character that is not a quote or newline”, use [^"\r\n]. Also, change the + (which means “one or more of the previous”) to +? (which means “one or more of the previous, but don’t be greedy about it”) so that it’s trying to match less at the same time.

Same results with this sadly:

(?<=")[^"\r\n]+?\.latest:/

The file in question has 5.5 million lines of text in it (296MB), maybe it’s just too much.

Terry R

@Ben-1 said in Help with regular expression:

The file in question has 5.5 million lines of text in it (296MB), maybe it’s just too much.

I have just come up with an alternative regular expression (regex) which successfully processed 7.5 million lines (all based on your examples).
It is
Find What:"[^:]+:/
Replace With:"

Before running it I would suggest counting the lines with a :/ in them. This is just to confirm that the regex won’t edit incorrectly some of your lines. To count, use the Find function and type :/ into the Find What field, click on Count. The number found has to equal the number of lines as seen at the bottom of the Notepad++ window. This gives us a high level of certainty that all lines contain those characters.

Possibly use my regex on a copy of your (large) file(s) in case you are uncertain.

Terry

PS the processing did take about 10 minutes, but no error such as you got!

PeterJones

@Ben-1 said in Help with regular expression:

(?<=")[^"\r\n]+?\.latest:/
The file in question has 5.5 million lines of text in it (296MB), maybe it’s just too much.

I thought there might be too many quotes… try changing to

(?<=: ")[^"\r\n]+?\.latest:/

… that should only match things that come after : " , which should help narrow down the range of matches.

I just pasted

"id" : "file.name.latest:/Custom/path/file.exe"
"uid" : "file.name.latest:/Custom/path/file.exe"
"sid" : "file.name.latest:/Custom/path/file.exe"

4 million times – well, really, select-all, copy, paste, so it doubles every time, so it’s really 32ⁿ=32²²=3*4.194M=12.582M lines, so there are 625MB. It does not claim it’s invalid, and while it takes a long time, it seemed to be working.

I first tried with the original, but accidentally had a space before the (?<="), and it gave me the error you showed when I tried to FIND; when I deleted that space, and did just a FIND, it had no problem finding the next few occurrences. So it’s not giving the error. Though when I did the Replace All, it’s taken at least 5 minutes without any apparent progress… Notepad++ does have performance issues with files that are more than a hundred megabytes.

Unfortunately, after another 5 minutes while I wrote up my reply, it still hasn’t finished for me, so I’m cancelling it – oh, no, just as I wrote that, it finished. The search with (?<=")[^"\r\n]+?\.latest:/ worked for me with 625MB.

But maybe your machine has different memory limitations than mine does.

Try seeing if the more-restrictive (?<=: ")[^"\r\n]+?\.latest:/ works for you. If you have a way to break up the 300MB file into smaller chunks, I might recommend trying that, too.

And Terry posted this while I was writing mine up:

Find What: "[^:]+:/

I’d recommend a modified "[^:"\r\n]+?" , otherwise it might still hit the problem of trying to wrap lines and being too greedy, if you have some data in the middle that doesn’t fit your simple format

A Former User

This is strange. None of those work for these larger files. Sadly, these files contains confidential information otherwise I’d just upload it. Tried restarting the computer just to make sure it wasn’t some odd memory issue, but this system has 64GB of RAM on 64bit OS so it shouldn’t be a memory limitation.

There are are multiple "s in every single line, and it seems to break after millions of these entries:
“BFCE9424314C03B3E”,
“B9603434384AC353E”,
“B00204343730E233E”,
{
“t” : “13.68”,
“v” : “-0.7822026”,
“c” : “3”,
“i” : “-0.7824571”,
“o” : “-0.7819693”
},

I am trying to warp my head around a better way to detect what I need to replace/delete without simply looking for ". The text before what I need to replace always contains the following:

" : "

So any text after (not including it) I need to delete, up to and including this text.

.latest:/

That might reduce the amount of things it needs to look for?

Terry R

@Ben-1
With your latest “examples” it seems that your file isn’t just lines of what you originally wanted edited, rather it contains a mixture of other types of lines, is that correct?

If so, then another method might be to:

number all the lines
mark all the lines which contain the text to be edited
cut these lines and paste into a blank tab
edit these lines in the new tab
paste the edited lines back
re-sort the lines so they are back in line order
remove the line numbers that were added in step #1

Terry

A Former User

@Terry-R
It would take thousands of hours with the amount of data and files there are. I think I am just going to tell this client I cannot figure out how to achieve what he wants and not bill him. I’ve already put hours of my personal time trying to figure it out and I thought it’d just be a pretty simple notepad++ mass edit.

For some reason the problem has gotten even worse, now the original expression is producing the error on files as little as 80MB, files that successfully processed yesterday with this exact expression.

(?<=")[^"]+\.latest:/

I do appreciate the help given though.

Coises

@Ben-1 I can’t promise this will work, but I would try:

Find what : ^([^:\r\n]++: ")[^:\r\n"]++(?<=\.latest):/
Replace with: $1

A Former User

@Coises said in Help with regular expression:

@Ben-1 I can’t promise this will work, but I would try:

Find what : ^([^:\r\n]++: ")[^:\r\n"]++(?<=\.latest):/
Replace with: $1

Thanks, it produces the results
Can’t find the text “^([^:\r\n]++: “)[^:\r\n”]++(?<=.latest):/”

Mark Olson

FWIW no regex-based solution to this problem really addresses the core issue here, which is that you are trying to use regular expressions to parse a non-regular language, namely JSON. Regex are awesome, but they are not the right tool for every job.

I didn’t post earlier here because the examples you gave earlier looked like they might not be syntactically valid JSON, and therefore would probably be impossible for most JSON libraries to work with.

For example

"a": "foo"
"b": "bar"

is not valid JSON
but

{
"a": "foo",
"b": "bar"
}

is valid JSON even though it looks similar.

And guess what?

{
“a”: ”foo”,
”b”: ”bar”
}

is not valid JSON, because the curly quote character ” is not the same as the ASCII double quote character ".

Hopefully this illustrates the importance of using the code boxes when posting in the forum!

While we’re at it, this also illustrates another important idea, which is that you should assume that every single character is relevant when working with computer data. For example, you may not see the point of the commas after the colon-separated strings, but most JSON parsers are very opinionated and they will refuse to parse your input if they’re missing.

Sadly, the JSON plugins for NPP are largely useless here because you are trying to work with a large number of files.

I would strongly recommend giving up on regular expressions for this task, and instead learn Python and try using the json module for this task.

If you choose to use Python, this forum is not the place to ask for help. Go to a general programming forum instead.

Coises

@Ben-1 said in Help with regular expression:

@Coises said in Help with regular expression:

@Ben-1 I can’t promise this will work, but I would try:

Find what : ^([^:\r\n]++: ")[^:\r\n"]++(?<=\.latest):/
Replace with: $1

Thanks, it produces the results
Can’t find the text “^([^:\r\n]++: “)[^:\r\n”]++(?<=.latest):/”

Hmmm… is your real data a little more complex than your examples? Perhaps you have something like:

"id" : "http://file.name.latest:/Custom/path/file.exe"
"uid" : "ftp://file.name.latest:/Custom/path/file.exe"
"sid" : "https://file.name.latest:/Custom/path/file.exe"

which includes a colon inside the second set of quotes before the colon in .latest:/? If so, then indeed, my expression would not match; fixing that would require knowing more about the exact patterns in your real data.

PeterJones

@Ben-1 said in Help with regular expression:

It would take thousands of hours with the amount of data and files there are. I think I am just going to tell this client I cannot figure out how to achieve what he wants and not bill him. I’ve already put hours of my personal time trying to figure it out and I thought it’d just be a pretty simple notepad++ mass edit.

Wait, what?! If we had given you a working regex, you would have billed a client and gotten paid for the work we did? Yeesh.

A Former User

I installed Sublime Text, put in the expression and it works flawlessly with all the files. Out of curiosity, I put in my original expression that I came up with before posting in the original thread and it works flawlessly in Sublime Text as well. So in the end, this is a notepad++ issue.

You are being ridiculous. I work at a computer repair company and a customer wanted help with something, I told him I’d look into it, couldn’t figure it out so I seeked help. Newflash… that’s what anyone you are paying to do any task is going to do when they hit a roadblock. YOU decide to be apart of online forum and help people for free. If you don’t enjoy doing that, don’t do it.

Thanks for your efforts and help but seriously the interactions I had on this forum were beyond strange and over controlling. I’m out of here.

PeterJones

@A-Former-User said in Help with regular expression:

the interactions I had on this forum were beyond strange and over controlling

For future readers, to sum up:

This individual came, asking us to do his job for him without telling us that’s what he was doing.
His question wasn’t asked well enough that he got any upvotes, but he got answers which matched our best guess for what he was trying to do. This is trying to be helpful, not controlling.
When he tried to reply, it triggered a problem with the forum software (which is provided to this Community for free, and we have no control over how it’s implemented), so he decided to delete his account because he couldn’t figure out how to delete a link from his reply.
When he asked his question as a new user, it was pointed out that it was a continuation of the original question (to better help people who were answering to have the full context for his question); and I went to the effort to explain to him how he could avoid that problem; and a user upvoted his explanation of why he re-created his account, in order to make it easier for him to post. This was me trying to be helpful, not controlling.
In both the original topic and again in this follow-on topic, when it was obvious that he was having difficulty expressing his question in a way that we could supply him with an answer that would solve his problem, I suggested he read the FAQs, and follow their instructions for how to ask his question better. This was not to “control” him, but rather so that he could better get his ideas across, so that we’d be able to better help him, and to avoid wasting his time (and ours) by continuing to post without formatting the data in a way that it comes across correctly. (And earlier, I had even reformatted a post for him, to help him better communicate his ideas.)
Once he was able to communicate his problem, we were able to give him an expression which worked, both on the example data he gave us, and reportedly on the initial data he was testing it on.
As he started to change the original parameters to the question, we still continued to provide alternate solutions.
When he hit a limitation for the file size, there were still multiple attempts to help him make it work.
But at this point, it was revealed that there were hidden constraints which he hadn’t mentioned to us, which made the data look quite different from the simple examples, which were exacerbating the difficulty with the file size… but attempts were still made to help.
When he gave up, he revealed that he was actually trying to earn money by having us solve this problem for him, and then tried to blame us (me) for being controlling when I highlighted the fact that he was trying to earn money on our free labor.

Please understand: if your job includes using Notepad++, and you have a Notepad++ question, you are still allowed to ask it. But when you make us go through so many iterations, and keep on changing the parameters of the question, then in the end reveal that all of this was solely so that you could hopefully earn some extra cash, it’s bound to raise some hackles. Be honest up front, respond to questions, and show a willingness to learn, and we bend over backward to try to help. Violate that trust, and we will sour on helping you.

----

Please note: This Community Forum is not a data transformation service; you should not expect to be able to always say “I have data like X and want it to look like Y” and have us do all the work for you. If you are new to the Forum, and new to regular expressions, we will often give help on the first one or two data-transformation questions, especially if they are well-asked and you show a willingness to learn; and we will point you to the documentation where you can learn how to do the data transformations for yourself in the future. But if you repeatedly ask us to do your work for you, you will find that the patience of usually-helpful Community members wears thin. The best way to learn regular expressions is by experimenting with them yourself, and getting a feel for how they work; having us spoon-feed you the answers without you putting in the effort doesn’t help you in the long term and is uninteresting and annoying for us.

PeterJones

As a parting shot, the original poster wrote,

I installed Sublime Text, put in the expression and it works flawlessly with all the files. Out of curiosity, I put in my original expression that I came up with before posting in the original thread and it works flawlessly in Sublime Text as well. So in the end, this is a notepad++ issue.

Do any of the regulars have enough experience with Sublime to know how it’s implemented differently from Notepad++, that might allow it to not complain with that regex in huge file, whereas Notepad++ does complain?

Is Sublime just better at chunking large files so that the memory never blows up?

Or does it use a different regex engine that’s more performant on large files?

Or do we think that the reports of errors were due to user error, not an actual issue with Notepad++?

If we can come up with an easy-to-reproduce large data file, and a simplified regex that works fine on Sublime and works fine on a small dataset in Notepad++, but blows up with a large dataset in Notepad++, we might be able to put in an issue – pointing out that the competition works better in that instance might light a fire under the developers, to see if they can improve that performance, which might in turn improve large-file search/replace performance in general, which would be a boon to all large-file users of Notepad++.