Getting "Invalid Regular Expression" for an extremely simple expression
-
@Alan-Kilborn said in Getting "Invalid Regular Expression" for an extremely simple expression:
My understanding is that PythonScript integrates its own copy of Boost, so, one would think, with all other things being equal (ha!), that it would succeed when N++ succeeds. But clearly something is not equal.
There is a macro variable, BOOST_REGEX_MAX_STATE_COUNT, that influences one of the limits Boost::regex tests when evaluating whether to issue that message. Notepad++ leaves it at its default value , but it is possible that Python changes it.
-
@PeterJones said in Getting "Invalid Regular Expression" for an extremely simple expression:
or
(?-m)^.*[^"]*test$
but that just results in “Invalid regular expression.”The first resulted in Invalid Regular Expression; the second just finds no match, because you’ve told it that
^
should only match the beginning of the file and$
should only match the end of the file, and your file is more than one line long.Well damn, you’re right. I was sure it gave me a syntax error for both of my examples. I must have gotten myself confused while I was testing.
Now that I see that, from my testing,
(?-m)
means ^$ should match the beginning and end of the same line (no intervening LF) and(?m)
(the default for NP++) means ^$ has to match the beginning and end of any line in the file. So(?-m)
absolutely affects the[^"]*
portion of the RE. -
Now that I see that, from my testing,
(?-m)
means ^$ should match the beginning and end of the same line (no intervening LF) and(?m)
(the default for NP++) means ^$ has to match the beginning and end of any line in the file. So(?-m)
absolutely affects the[^"]*
portion of the RE.I guess I used loose terminology when I described what
(?m)
/(?-m)
affect. I should have said those options affect the beginning-of-line^
anchor and the end-of-line$
anchor.They don’t affect all
^
symbols, because in some locations, like the beginning of a character class where it negates the character class, and has nothing to do with the beginning-of-line-anchor^
. To clarify,[^"]
literally means “the class that contains every character that is not the ASCII double-quote”, and the^
in that class is the class-negation operator, it is not the beginning-of-line anchor nor the literal ASCII caret character.With those definitions, I cannot see how
(?m)
/(?-m)
affect[^"]*
. But, maybe I’m wrong. Can you share a text file and regex where they change the meaning of the[^"]*
? (It would have to be something other than a regex that contains a^
or$
anchor, because those two anchors are affected by the m-option)Further, your statement of what the anchors mean in the non-multiline context (“
(?-m)
means ^$ should match the beginning and end of the same line (no intervening LF)”) is not phrased in a way that matches with my experience and understanding of the specs. But maybe I am not interpreting that phrase in the way you intended.For this example, I will start with a 3-line file (ie, no empty line 4)
This file has multiple lines in it
If I run the regex
(?m)^
and hit Find Next repeatedly, it will match at three locations, because^
can match any beginning-of-line in that mode. If I run the regex(?-m)^
, Find Next will only match the beginning of the first line, not the beginning of lines 2 or 3, because(?-m)
restricts^
to only be the beginning of the string rather than of any line (where, in Notepad++, the string is either the entire file). Similarly,(?m)$
will match the end of lines 1, 2, and 3; whereas(?-m)$
will only match the end of the last line of the file.Your phrasing indicates to me that you think that the
^
and$
have to be on the same line in(?-m)
mode, but my examples show that’s not right – but again, maybe I am misunderstanding your sentence.Combining the two ideas: the example file has no quote marks, so
[^"]*
will match all the non-quote characters the same file). Thus,(?m)^[^"]*$
will match from the beginning of the file to the end, as will(?-m)^[^"]*$
– the m-state is irrelevant. Then make it non-greedy:(?m)^[^"]*?$
will only match one line at a time, because the$
causes the non-greedy section before it to stop at the first end-of-line found; on the other hand,(?-m)^[^"]*?$
will still match the entire file – because the^
anchor only matches at one location in the entire file (at the beginning) and the$
anchor only ; in this non-greedy, the m-state changes the meaning of the^
and$
anchors, not the meaning of the[^"]*?
. -
Hello, @scott-gartner, @alan-kilborn, @coises, @mkupper, @mark-Olson, @terry-r, @peterjones and All,
As mentionned by @alan-kilborn, I found out some spare time to download and test my two files
Test_1_OK.txt
andTest_2_KO.txt
with theGrepWin
softwareSo, here is, below, the road map for testing.
-
In a new folder, put the two files
Test_1_OK.txt
andTest_2_KO.txt
, already tested within Notepad++ -
Download, in this folder, the last portable
x64
version from :
https://github.com/stefankueng/grepWin/releases/download/2.1.1/grepWin-x64-2.1.1_portable.zip
-
Double-click on file
grepWin-x64-2.1.1_portable.zip
-
Extract the single file
grepWin-x64-2.1.1_portable.exe
, in this folder -
Double-click on file
grepWin-x64-2.1.1_portable.exe
=> You should get this picture :
-
Enter the name of the new folder in the
Search in
zone -
Select
Regex
search mode -
Enter
".*employeeId"
in the Search for zone -
Check the
Treat Files as UTF8
box option -
Enter
*.txt
( or more exactlyTest_?_??.txt
) in the Find names match zone -
Finally, click on the
Search
button
After
2 / 3
seconds, you should get this picture :As you can see :
-
It does find one match, regarding the
Test_1_OK.txt
file -
It find a
Regex stack error
, regarding theTest_2_KO.txt
file
It quite obvious that the results are strictely identical to the ones obtained from within N++. Particularly, note that the error message, regarding
Test_2_KO.txt
file, is also the same as the one shown in the N++ search dialog, which proves that the error message is aBoost
message itself !!Thus, it seems to me that this bug can be considered rather a
Boost Engine
bug !
Now, if, at the bottom, we click on the
Content
button, we get this picture :Note that it does show that one match has been found, either, in the
Test_2_KO.txt
file !
Finally, the last picture just confirms that I did my tests with the last
GrepWin 2.1.1
release :
Now, should we ask
John Maddock
about it ? There are probably a lot of other BORDER cases ! Its’s a combination of a specific regular expression with specific data. As @coises said :The message is the result of a heuristic, not a mathematically exact determination. It doesn’t mean the regular expression is technically invalid, it means that, when applied to the data in question, it appears to be very inefficient (possibly — not necessarily — non-terminating).
For these special cases, the best to do is, indeed, to refactor the regular expression, in order that each part can be considered as unambiguous !!
Best Regards,
guy038
-
-
@guy038 said in Getting "Invalid Regular Expression" for an extremely simple expression:
There are probably a lot of other BORDER cases
If you’re up for some reading about theory, take a look here:
https://swtch.com/~rsc/regexp/
The super-short version of that is that regular expression matching can be very efficient (linear in the length of the text being matched) if you allow only the most basic, original syntax of regular expressions. Once you support things like capture groups, non-greedy repeats and (especially) back references, the time can be at least quadratic (and I think sometimes even worse) in the length of the text to be examined.
It would seem that it should be possible to try a regular expression with an efficient engine first; if it parses, the job is done; if it says the expression isn’t valid within the more limited syntax of the efficient engine, then give it to the potentially slow but more comprehensive engine.
-
@Coises said in Getting "Invalid Regular Expression" for an extremely simple expression:
It would seem that it should be possible to try a regular expression with an efficient engine first; if it parses, the job is done; if it says the expression isn’t valid within the more limited syntax of the efficient engine, then give it to the potentially slow but more comprehensive engine.
Are you proposing that Notepad++ implement something like this?
-
@Alan-Kilborn said in Getting "Invalid Regular Expression" for an extremely simple expression:
Are you proposing that Notepad++ implement something like this?
I was more “speculating” than “proposing.”
I think I’d want to see proof of value of something like this in a plugin — perhaps the search in my own Columns++, or perhaps in @Thomas-Knoefel’s MultiReplace — before I would suggest changing the implementation of a fundamental feature of Notepad++ itself (though in principle it would be transparent to users, just faster and with fewer of these obscure “complexity” messages).
-
Hello, @coises and All,
@coises, I searched a bit on the Internet and, according to an article of https://stackoverflow.com , I came across a series of tests to compare different regular expression engines :
- The older was provided by
John Maddowck
, in2003
:
https://www.boost.org/doc/libs/1_41_0/libs/regex/doc/gcc-performance.html
- An other one, on
GitHub
, with the same tests, was last modified in2015
:
https://zherczeg.github.io/sljit/regex_perf.html
The most recent, from the
Rust
community, with the same tests, either, in2018
:https://rust-leipzig.github.io/regex/2017/03/28/comparison-of-regex-engines/
You can get the main test text, from the
Gutenberg
project at :http://www.gutenberg.org/files/3200/old/mtent12.zip
And here is the results of these tests :
https://i.sstatic.net/ORL3Z.png
From this picture, here are the different links to get information about all these regex libraries, in the order from left to right :
https://github.com/hanickadot/compile-time-regular-expressions
https://theboostcpplibraries.com/boost.regex
https://cplusplus.com/reference/regex/
https://github.com/PCRE2Project/pcre2
https://www.pcre.org/current/doc/html/pcre2matching.html
https://www.pcre.org/original/doc/html/pcrejit.html
https://github.com/kkos/oniguruma
https://github.com/laurikari/tre
https://github.com/intel/hyperscan
https://github.com/rust-lang/regex
https://docs.rs/regex/latest/regex/struct.Regex.html ( not totally sure ? )
For instance, I did a try of the last text regex
(.*?,){13}z
against the completemtent12.txt
test file, extracted from themtent12.zip
archive, which, of course, fails miserably :-((Then, I tried this other regex formulation
(?:[^,]*,){13}[\u\l]
, without success, too ! However, I noticed that beginning at line500,000
and searching downward does find one match !So, I changed my strategy and simply marked all matches of the regex
,[\u\l]
. As, normally, any comma is always followed with aspace
char, I should not get many matchs !As planned, I got
11
matches : a comma followed with a lower-case letter ! (,a
× 2,,b
,,g
,,h
,,m
,,n
,,s
× 2,,t
and,w
)Note that the requested case
,z
does not exist at all !And when moving the caret, let’s say,
100 - 200
lines before each of these matches, it allowed me to easily get all these matches !At this point, I tried to select all the zones around these
11
matches in a small new file, that I namedMatches.txt
. Then, using the Mark dialog with(?:[^,]*,){13}[\u\l]
, against this small file, it does return10
matches ( not11
as explained in the next post ! )However, it is distressing to note that the equivalent regex
(?:.*?,){13}[\u\l]
still fails against this tinyMatches.txt
file, of only16,138
bytes :-((Unfortunately, it’s quite certain that cases, like that one, may arise when using most of the available regex engines !
In the next post, you’ll find the
Matches.txt
contents, for any further testing. My default test, which works nicely, is to mark multi-lines text, matching the(?:[^,]*,){13}[\u\l]
regex !Best Regards,
guy038
- The older was provided by
-
Hi,All,
================================================================================ BEGINNING of file ..... ..... ..... ================================================================================ Line 76,477 === more insupportable the clatter became, the more enchanted they all appeared to be. When there was silence, Mrs Sellers lifted upon Washington a face that beamed with a childlike pride, and said: "It belonged to his grandmother." The look and the tone were a plain call for admiring surprise, and therefore Washington said (it was the only thing that offered itself at the moment:) "Indeed!" "Yes, it did, didn't it father!" exclaimed one of the twins. "She was my great-grandmother--and George's too; wasn't she, father! You never saw her, but Sis has seen her, when Sis was a baby-didn't you, Sis! Sis has seen her most a hundred times. She was awful deef--she's dead, now. Aint she, father!" All the children chimed in, now, with one general Babel of information about deceased--nobody offering to read the riot act or seeming to discountenance the insurrection or disapprove of it in any way--but the head twin drowned all the turmoil and held his own against the field: "It's our clock, now--and it's ,got wheels inside of it, and a thing that flatters every time she strikes--don't it, father! Great-grandmother died before hardly any of us was born--she was an Old-School Baptist and ================================================================================ Line 76,527 === ..... ..... ..... ================================================================================ Line 147,911 === Welcome and home were mine within this State, Whose vales I leave -- whose spires fade fast from me And cold must be mine eyes, and heart, and tete, When, dear Alabama! they turn cold on thee!" There were very few there who knew what "tete" meant, but the poem was very satisfactory, nevertheless. Next appeared a dark-complexioned, black-eyed, black-haired young lady, who paused an impressive moment, assumed a tragic expression, and began to read in a measured, solemn tone: "A VISION "Dark and tempestuous was night. Around the throne on high not a single star quivered; but the deep intonations of the heavy thunder constantly vibrated upon the ear; whilst the terrific lightning revelled in angry mood through the cloudy chambers of heaven, seeming to scorn the power exerted over its terror by the illustrious Franklin! Even the boisterous winds unanimously came forth from their mystic homes, and blustered about as if to enhance by their aid the wildness of the scene. "At such a time,so dark,so dreary, for human sympathy my very spirit sighed; but instead thereof, ================================================================================ Line 147,967 === ..... ..... ..... ================================================================================ Line 257,829 === Then I told her my father and mother was dead, and the law had bound me out to a mean old farmer in the country thirty mile back from the river, and he treated me so bad I couldn't stand it no longer; he went away to be gone a couple of days, and so I took my chance and stole some of his daughter's old clothes and cleared out, and I had been three nights coming the thirty miles. I traveled nights, and hid daytimes and slept, and the bag of bread and meat I carried from home lasted me all the way, and I had a-plenty. I said I believed my uncle Abner Moore would take care of me, and so that was why I struck out for this town of Goshen. "Goshen, child? This ain't Goshen. This is St. Petersburg. Goshen's ten mile further up the river. Who told you this was Goshen?" "Why, a man I met at daybreak this morning, just as I was going to turn into the woods for my regular sleep. He told me when the roads forked I must take the right hand, and five mile would fetch me to Goshen." "He was drunk, I reckon. He told you just ex- actly wrong." "Well,,he did act like he was drunk, but it ain't no matter now. I got to be moving along. I'll fetch Goshen before daylight." ================================================================================ Line 257,887 === ..... ..... ..... ================================================================================ Line 272,599 === all busted up and ruined, because they could have the heart to serve Jim such a trick as that, and make him a slave again all his life, and amongst strangers, too, for forty dirty dollars. Once I said to myself it would be a thousand times better for Jim to be a slave at home where his family was, as long as he'd GOT to be a slave, and so I'd better write a letter to Tom Sawyer and tell him to tell Miss Watson where he was. But I soon give up that notion for two things: she'd be mad and disgusted at his rascality and ungratefulness for leaving her, and so she'd sell him straight down the river again; and if she didn't, everybody naturally despises an ungrateful nigger, and they'd make Jim feel it all the time, and so he'd feel ornery and disgraced. And then think of ME! It would get all around that Huck Finn helped a nigger to get his freedom; and if I was ever to see anybody from that town again I'd be ready to get down and lick his boots for shame. That's just the way: a person does a low-down thing, and then he don't want to take no consequences of it. Thinks as long as he can hide, it ain't no disgrace. That was my fix exactly. The more I studied about this the more my conscience went to grinding me, and the more wicked and low-down and ornery I got to feel- ing. And at last, when it hit me all of a sudden that here was the plain hand of Providence slapping me in the face and letting me know my wickedness was being watched all the time from up there in heaven,whilst I was stealing a poor old woman's nigger that hadn't ever done me no harm, and now was showing me there's One that's always on the lookout, and ain't a- ================================================================================ Line 272,663 === ..... ..... ..... ================================================================================ Line 371,705 === person goads, and crowds, and in a manner forces another person to talk, it is neither very fair nor very good-mannered to call what he says clack." "Oh, snuffle--do! and break your heart, you poor thing. Somebody fetch this sick doll a sugar-rag. Look you, Sir Jean de Metz, do you feel absolutely certain about that thing?" "What thing?" "Why, that Jean and Pierre are going to take precedence of all the lay noblesse hereabouts except the Duke d'Alenon?" "I think there is not a doubt of it." The Standard-Bearer was deep in thoughts and dreams a few moments, then the silk-and-velvet expanse of his vast breast rose and fell with a sigh, and he said: "Dear, dear, what a lift it is! It just shows what luck can do. Well, I don't care. I shouldn't care to be a painted accident--I shouldn't value it. I am prouder to have climbed up to where I am just by sheer natural merit than I would be to ride the very sun in the zenith and have to reflect that I was nothing but a poor little accident, and got shot up there out of somebody else's catapult. To me, merit is everything--in fact, the only thing. All else is dross." Just then the bugles blew the assembly, and that cut our talk short. Chapter 25 At Last--Forward! THE DAYS began to waste away--and nothing decided,nothing done. The army was full of zeal, but it was also hungry. It got no pay, the treasury was getting empty, it was becoming impossible to feed it; under pressure of privation it began to fall apart and ================================================================================ Line 371,773 === ..... ..... ..... ================================================================================ Line 378,129 === looking on in tears, all the way, enemies laughing. We reached Gien at last--that place whence we had set out on our splendid march toward Rheims less than three months before, with flags flying, bands playing, the victory-flush of Patay glowing in our faces, and the massed multitudes shouting and praising and giving us godspeed. There was a dull rain falling now, the day was dark, the heavens mourned, the spectators were few, we had no welcome but the welcome of silence, and pity, and tears. Then the King disbanded that noble army of heroes; it furled its flags, it stored its arms: the disgrace of France was complete. La Tremouille wore the victor's crown; Joan of Arc, the unconquerable, was conquered. Chapter 41 The Maid Will March No More YES, IT was as I have said: Joan had Paris and France in her grip,and the Hundred Years' War under her heel, and the King made her open her fist and take away her foot. ================================================================================ Line 378,165 === ..... ..... ..... ================================================================================ Line 503,387 === been disguised and set at lowly occupations for dramatic effect, but I think McClintock is the first to send one of them to school. Thus, in this book, you pass from wonder to wonder, through gardens of hidden treasure, where giant streams bloom before you, and behind you, and all around, and you feel as happy, and groggy, and satisfied with your quart of mixed metaphor aboard as you would if it had been mixed in a sample-room and delivered from a jug. Now we come upon some more McClintockian surprise--a sweetheart who is sprung upon us without any preparation, along with a name for her which is even a little more of a surprise than she herself is. In 1842 he entered the class, and made rapid progress in the English and Latin departments. Indeed, he continued advancing with such rapidity that he was like to become the first in his class, and made such unexpected progress, and was so studious, that he had almost forgotten the pictured saint of his affections. The fresh wreaths of the pine and cypress had waited anxiously to drop once more the dews of Heaven upon the heads of those who had so often poured forth the tender emotions of their souls under its boughs. He was aware of the pleasure that he had seen there. So one evening ,as he was returning from his reading, he concluded he would pay a visit to this enchanting spot. Little did he think of witnessing a shadow of his former happiness, though no doubt he wished it might be so. ================================================================================ Line 503,435 === ..... ..... ..... ================================================================================ Line 503,091 === In 1842 he entered the class, and made rapid progress in the English and Latin departments. Indeed, he continued advancing with such rapidity that he was like to become the first in his class, and made such unexpected progress, and was so studious, that he had almost forgotten the pictured saint of his affections. The fresh wreaths of the pine and cypress had waited anxiously to drop once more the dews of Heavens upon the heads of those who had so often poured forth the tender emotions of their souls under its boughs. He was aware of the pleasure that he had seen there. So one evening, as he was returning from his reading, he concluded he would pay a visit to this enchanting spot. Little did he think of witnessing a shadow of his former happiness, though no doubt he wished it might be so. He continued sauntering by the roadside, meditating on the past. The nearer he approached the spot, the more anxious he became. At the moment a tall female figure flitted across his path, with a bunch of roses in her hand; her countenance showed uncommon vivacity, with a resolute spirit; her ivory teeth already appeared as she smiled beautifully, promenading--while her ringlets of hair dangled unconsciously around her snowy neck. Nothing was wanting to complete her beauty. The tinge of the rose was in full bloom upon her cheek; the charms of sensibility and tenderness were always her associates.. In Ambulinia's bosom dwelt a noble soul--one that never faded-- one that never was conquered. Her heart yielded to no feeling but the love of Elfonzo, on whom she gazed with intense delight, and to whom she felt herself more closely bound ,because he sought the hand of no other. Elfonzo was roused from his apparent reverie. His books no longer were his inseparable companions--his thoughts arrayed themselves to encourage him in the field of victory. ================================================================================ Line 505,145 === ..... ..... ..... ================================================================================ Line 649,533 === that slavery was a bald, grotesque, and unwarranted ursurpation. She had never heard it assailed in any pulpit, but had heard it defended and sanctified in a thousand. As far as her experience went, the wise, the good, and the holy were unanimous in the belief that slavery was right, righteous, sacred, the peculiar pet of the Deity, and a condition which the slave himself ought to be daily and nightly thankful for." Yet Jane Clemens must have had qualms at times--vague, unassembled doubts that troubled her spirit. After Jennie was gone a little black chore-boy was hired from his owner, who had bought him on the east shore of Maryland and brought him to that remote Western village, far from family and friends. He was a cheery spirit in spite of that, and gentle, but very noisy. All day he went about singing, whistling, and whooping until his noise became monotonous, maddening. One day Little Sam said: "Ma--[that was the Southern term]--,make Sandy stop singing all the time. It's awful." Tears suddenly came into his mother's eyes. ================================================================================ Line 649,573 === ..... ..... ..... ================================================================================ Line 663,603 === literati, local and visiting, used to gather there. Names that would be well known later were included in that little band. Joaquin Miller recalls from an old diary, kept by him then, having seen Adah Isaacs Menken, Prentice Mulford, Bret Harte, Charles Warren Stoddard, Fitzhugh Ludlow, Mark Twain, Orpheus C. Kerr, Artemus Ward, Gilbert Densmore, W. S. Kendall, and Mrs. Hitchcock assembled there at one time. The Era office would seem to have been a sort of Mount Olympus, or Parnassus, perhaps; for these were mainly poets, who had scarcely yet attained to the dignity of gods. Miller was hardly more than a youth then, and this grand assemblage impressed him, as did the imposing appointments of the place. The Era rooms were elegant--[he says]--,the most grandly carpeted and most gorgeously furnished that I have ever seen. Even now in my memory they seem to have been simply palatial. I have seen the world well since then--all of its splendors worth seeing--yet those ================================================================================ Line 663,633 === ..... ..... ..... ================================================================================ END of file
As you can see, the second match stops at the first
,s
, with an other string,s
, on the same line, not part of the second matchBR,
guy038
-
@guy038 said in Getting "Invalid Regular Expression" for an extremely simple expression:
At this point, I tried to select all the zones around these
11
matches in a small new file, that I namedMatches.txt
. Then, using the Mark dialog with(?:[^,]*,){13}[\u\l]
, against this small file, it does return10
matches ( not11
as explained in the next post ! )However, it is distressing to note that the equivalent regex
(?:.*?,){13}[\u\l]
still fails against this tinyMatches.txt
file, of only16,138
bytes :-((Unfortunately, it’s quite certain that cases, like that one, may arise when using most of the available regex engines !
There are two ways an implementation can look at a regex:
- A regex is a definition of matching character strings.
- A regex is a procedure for matching character strings.
From the first perspective, your two expressions are equivalent: they specify the same strings as matches. From the second perspective, they are not: they specify different procedures for finding strings that match.
No one has found a way to implement back references using method 1. Once your regular expression syntax includes the ability to use back references, you are stuck with the procedural interpretation.
There are other features of PERL-compatible regular expressions that present problems, but back references are the killer.
I’m speculating here, but I think once you include any back reference in an expression, it breaks the ability to process any part of the expression that occurs before the back reference as a definition rather than a procedure. (I’m not certain of that. I have no doubt someone does know the answer to that… but that someone isn’t me.)
So I think you’ll find all those more efficient regular expression engines implement a severely restricted syntax for regular expressions which omits features none of us would like to do without (particularly, back references).
What I’ve also speculated is that perhaps a regular expression engine could include two engines: one which processes using the ”definition” approach for expressions to which it is applicable, and one which uses the “procedural” approach for the remaining expressions. I don’t know if any do that now.