Wildcard Search
-
Is it possible to search for a string ‘n’ characters or digits long?
Specifically, what I’m trying to construct a macro to convert unformatted data into csv: I want to be able to search for all occurrences of numbers ‘n’ digits long and replace the following space with a comma. The string of digits is to remain unchanged.
Any assistance/suggestions appreciated.
Best regards,
Peter -
@Peter-McCormack said:
search for a string ‘n’ characters
It certainly is possible to do as you ask. What you would primarily be using in the regular expression are class operators and a quantifier.
For example\d{5}
means look for 5 digits together. We can expand on this and say\d{5,8}
which asks for at least 5, and up to 8 digits together, tending to as few as necessary. A further expansion of this is to say\d{5,8}+
which says between 5 and 8 digits but more rather than less, so it’s greedy.Now you say the data is unformatted so unless it is of fixed width, you run into the possibility of grabbing data meant for another field into one you are looking for. Hopefully your data is of fixed width or fields residing either side are non digits, say alpha characters which will prevent these examples from pulling more than they should.
This information, by the way comes from www.rexegg.com. Specifically
https://www.rexegg.com/regex-quantifiers.html
and
https://www.rexegg.com/regex-class-operations.htmlIf you need further help you might be best to provide some examples of the unformatted data. We can better help you with all the facts available.
Terry
-
Hello, @peter-mccormack, @terry-r and All,
Terry, you said :
We can expand on this and say \d{5,8} which asks for at least 5, and up to 8 digits together, tending to as few as necessary. A further expansion of this is to say \d{5,8}+ which says between 5 and 8 digits but more rather than less, so it’s greedy.
I’m really sorry, Terry, but your reasoning, about lazy, greedy and possessive quantifiers, is not exact !
-
First, if we consider, for instance, the general syntax
A{2,9}
, this defines3
types of quantifiers :-
The greedy quantifier
A{2,9}
which tries to match as many lettersA
as possible, with a maximum of9
lettersA
-
The lazy quantifier
A{2,9}?
which tries to match as few lettersA
as possible, so a minimum of2
lettersA
-
The possessive quantifier
A{2,9}+
which tries to match as many lettersA
as possible, with a maximum of9
lettersA
, but which NEVER allows the regex engine to backtrack so that the overall pattern would match !
-
Let’s suppose that our sample text, to test some regex syntaxes, is the simple string AAAAAAAAA (
9
lettersA
), in a new tab-
The regex
(?-i)A{2,9}?A
matches the string AAA : Logic, becauseA{2,9}?
matches AA, as a lazy quantifier. ThenA
matches the third A, of course ! -
The regex
(?-i)A{2,9}A
matches all the string AAAAAAAAA. Again logic, but this needs a quick explanation :-
First, the part
A{2,9}
matches the entire string AAAAAAAAA (9
letters ), but, now, there’s NO more text, to satisfy the last part of the regexA
-
So, the regex engine backtracks and the part
A{2,9}
match the string AAAAAAAA (8
letters only ). This time, the reminder of the regex :A
can match the9th
letter A !
-
-
The regex
(?-i)A{2,9}+A
, with the possessive quantifier, matches nothing ! Do you understand the logic of this result ?-
Like above, the part
A{2,9}+
matches the entire string AAAAAAAAA. And again, there NO more text which could be matched byA
, the reminder of the regex ! -
But, unlike the case above, due to the possessive quantifier, the regex engine is NOT allowed, this time, to backtrack. So, as the regex don’t have other alternatives, the regex engine cannot match our subject string and process stops !
-
-
The slightly modified regex
(?-i)A{2,8}+A
, although containing a possessive quantifier, does match the entire string AAAAAAAAA ! I suppose you’ve already guessed why :-))-
First, the part
A{2,8}+
although possessive, matches the string AAAAAAAA ( its maximum :8
letters ) and the last partA
of the regex matches the9th
letter A -
This time, NO need to backtrack : the overall pattern match our subject string AAAAAAAAA
-
Keeping again our sample text AAAAAAAAA, let’s test some other regexes :
-
The regex
(?-i)A{2,9}?A{2,9}?
, with two lazy quantifiers, matches the string AAAA (2
times the minimum of2
letters ) -
The regex
(?-i)A{2,9}?A{2,9}
, with a lazy quantifier, followed by a greedy one, matches the entire string AAAAAAAAA ( The first partA{2,9}?
matches AA and the last partA{2,9}
matches AAAAAAA ) -
The regex
(?-i)A{2,9}A{2,9}?
, with a greedy quantifier, followed by a lazy one, matches, first, all the subject string :-
Indeed, the first part
A{2,9}
can match all the subject string (9
letters ) -
As there NO more text for the last part
A{2,9}?
, the regex engine backtracks1
position -
So, the first part
A{2,9}
matches the8
-chars string AAAAAAAA but the last partA{2,9}?
cannot match the9th
letter A, as a minimum of two lettersA
is required ! -
Again, the regex engine backtracks
1
position. So the first partA{2,9}
matches the7
-chars string AAAAAAA -
This time, the last part
A{2,9}?
can match the string AA : Done !
-
-
The regex
(?-i)A{2,9}A{2,9}
, with two greedy quantifiers, matches the entire sting AAAAAAAAA. Logic, as, like above, after two backtracking processes, the first partA{2,9}
matches the7
-chars string AAAAAAAA and the last partA{2,9}
matches the8th
and9th
letter A, so AA ( UPDATED 07-05-2019) -
The regex
(?-i)A{2,9}+A{2,9}
, with a possessive quantifier, followed by a greedy one, matches nothing. Why ?- The first part
A{2,9}+
matches the entire string AAAAAAAAA, at the beginning. But, as NO more text can be matched by the last partA{2,9}
and that backtracking is not allowed because the quantifier is possessive, the process stops without any match !
- The first part
-
Finally, the regex
(?-i)A{2,9}+A{2,9}?
, with a possessive quantifier, followed by a lazy one, would produce the same results and gives no match
You could say : So, what is the benefit of using possessive quantifiers ?
-
First, to speed up regular expressions. In particular, they help some alternatives, of your regex, to fail faster !
-
Secondly, they prevent the regex engine from trying all possible permutations. This can be useful for performance reasons !
-
Thirdly, in case of nested quantifiers, for instance, they may save your day by preventing the regex engine from the catastrophic backtracking event :-((
One example :
Let’s imagine the regex
(?-i)A*Z
, against this sample text AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA, in a new tabThe regex engine, due to the greedy quantifier, will backtrack
50
times to realize that, at any position, an uppercase letterZ
cannot be found -((Let’s consider, now, the regex
(?-i)A*+Z
. This time, the possessive partA*+
grabs all the letters A and there NO more text to match theZ
regex,. And, as backtracking is not allowed, the regex fails faster and so, the regex engine can search, for other matches, further on, more quickly !Of course, you don’t see any difference between the two cases but, for long texts and/or complicated regexes, this may be significant !!
Finally, here is, below, a summary table of all the quantifiers :
•--------------------------------------• | QUANTIFIERS | •---------------•----------•------------•--------------• | REPETITIONS | Greedy | Lazy | Possessive | •---------------•----------•------------•--------------• | From n to ∞ | {n,} | {n,}? | {n,}+ | •---------------•----------•------------•--------------• | From n to m | {n,m} | {n,m}? | {n,m}+ | •---------------•----------•------------•--------------• | From 0 to 1 | ? | ?? | ?+ | •---------------•----------•------------•--------------• | From 0 to ∞ | * | *? | *+ | •---------------•----------•------------•--------------• | From 1 to ∞ | + | +? | ++ | •---------------•----------•------------•--------------• | From n to n | {n} | •---------------•--------------------------------------•
Note that the
{n}
quantifier cannot be qualified with the flavorsGreedy
,Lazy
orPossessive
. It just means exactlyn
times, the character or expression, right before !Best Regards
guy038
-