Wildcard Search

Peter McCormack

Is it possible to search for a string ‘n’ characters or digits long?
Specifically, what I’m trying to construct a macro to convert unformatted data into csv: I want to be able to search for all occurrences of numbers ‘n’ digits long and replace the following space with a comma. The string of digits is to remain unchanged.
Any assistance/suggestions appreciated.
Best regards,
Peter

Terry R

@Peter-McCormack said:

search for a string ‘n’ characters

It certainly is possible to do as you ask. What you would primarily be using in the regular expression are class operators and a quantifier.
For example \d{5} means look for 5 digits together. We can expand on this and say \d{5,8} which asks for at least 5, and up to 8 digits together, tending to as few as necessary. A further expansion of this is to say \d{5,8}+ which says between 5 and 8 digits but more rather than less, so it’s greedy.

Now you say the data is unformatted so unless it is of fixed width, you run into the possibility of grabbing data meant for another field into one you are looking for. Hopefully your data is of fixed width or fields residing either side are non digits, say alpha characters which will prevent these examples from pulling more than they should.

This information, by the way comes from www.rexegg.com. Specifically
https://www.rexegg.com/regex-quantifiers.html
and
https://www.rexegg.com/regex-class-operations.html

If you need further help you might be best to provide some examples of the unformatted data. We can better help you with all the facts available.

Terry

guy038

Hello, @peter-mccormack, @terry-r and All,

Terry, you said :

We can expand on this and say \d{5,8} which asks for at least 5, and up to 8 digits together, tending to as few as necessary. A further expansion of this is to say \d{5,8}+ which says between 5 and 8 digits but more rather than less, so it’s greedy.

I’m really sorry, Terry, but your reasoning, about lazy, greedy and possessive quantifiers, is not exact !

First, if we consider, for instance, the general syntax A{2,9}, this defines 3 types of quantifiers :
- The greedy quantifier A{2,9} which tries to match as many letters A as possible, with a maximum of 9 letters A
- The lazy quantifier A{2,9}? which tries to match as few letters A as possible, so a minimum of 2 letters A
- The possessive quantifier A{2,9}+ which tries to match as many letters A as possible, with a maximum of 9 letters A, but which NEVER allows the regex engine to backtrack so that the overall pattern would match !

Let’s suppose that our sample text, to test some regex syntaxes, is the simple string AAAAAAAAA ( 9 letters A ), in a new tab

The regex (?-i)A{2,9}?A matches the string AAA : Logic, because A{2,9}? matches AA, as a lazy quantifier. Then A matches the third A, of course !
The regex (?-i)A{2,9}A matches all the string AAAAAAAAA. Again logic, but this needs a quick explanation :
- First, the part A{2,9} matches the entire string AAAAAAAAA ( 9 letters ), but, now, there’s NO more text, to satisfy the last part of the regex A
- So, the regex engine backtracks and the part A{2,9} match the string AAAAAAAA (8 letters only ). This time, the reminder of the regex : A can match the 9th letter A !
The regex (?-i)A{2,9}+A, with the possessive quantifier, matches nothing ! Do you understand the logic of this result ?
- Like above, the part A{2,9}+ matches the entire string AAAAAAAAA. And again, there NO more text which could be matched by A, the reminder of the regex !
- But, unlike the case above, due to the possessive quantifier, the regex engine is NOT allowed, this time, to backtrack. So, as the regex don’t have other alternatives, the regex engine cannot match our subject string and process stops !
The slightly modified regex (?-i)A{2,8}+A, although containing a possessive quantifier, does match the entire string AAAAAAAAA ! I suppose you’ve already guessed why :-))
- First, the part A{2,8}+ although possessive, matches the string AAAAAAAA ( its maximum : 8 letters ) and the last part A of the regex matches the 9th letter A
- This time, NO need to backtrack : the overall pattern match our subject string AAAAAAAAA

Keeping again our sample text AAAAAAAAA, let’s test some other regexes :

The regex (?-i)A{2,9}?A{2,9}?, with two lazy quantifiers, matches the string AAAA ( 2 times the minimum of 2 letters )
The regex (?-i)A{2,9}?A{2,9}, with a lazy quantifier, followed by a greedy one, matches the entire string AAAAAAAAA ( The first part A{2,9}? matches AA and the last part A{2,9} matches AAAAAAA )
The regex (?-i)A{2,9}A{2,9}?, with a greedy quantifier, followed by a lazy one, matches, first, all the subject string :
- Indeed, the first part A{2,9} can match all the subject string ( 9 letters )
- As there NO more text for the last part A{2,9}?, the regex engine backtracks 1 position
- So, the first part A{2,9} matches the 8-chars string AAAAAAAA but the last part A{2,9}? cannot match the 9th letter A, as a minimum of two letters A is required !
- Again, the regex engine backtracks 1 position. So the first part A{2,9} matches the 7-chars string AAAAAAA
- This time, the last part A{2,9}? can match the string AA : Done !
The regex (?-i)A{2,9}A{2,9}, with two greedy quantifiers, matches the entire sting AAAAAAAAA. Logic, as, like above, after two backtracking processes, the first part A{2,9} matches the 7-chars string AAAAAAAA and the last part A{2,9} matches the 8th and 9th letter A, so AA ( UPDATED 07-05-2019)
The regex (?-i)A{2,9}+A{2,9}, with a possessive quantifier, followed by a greedy one, matches nothing. Why ?
- The first part A{2,9}+ matches the entire string AAAAAAAAA, at the beginning. But, as NO more text can be matched by the last part A{2,9} and that backtracking is not allowed because the quantifier is possessive, the process stops without any match !
Finally, the regex (?-i)A{2,9}+A{2,9}?, with a possessive quantifier, followed by a lazy one, would produce the same results and gives no match

You could say : So, what is the benefit of using possessive quantifiers ?

First, to speed up regular expressions. In particular, they help some alternatives, of your regex, to fail faster !
Secondly, they prevent the regex engine from trying all possible permutations. This can be useful for performance reasons !
Thirdly, in case of nested quantifiers, for instance, they may save your day by preventing the regex engine from the catastrophic backtracking event :-((

One example :

Let’s imagine the regex (?-i)A*Z, against this sample text AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA, in a new tab

The regex engine, due to the greedy quantifier, will backtrack 50 times to realize that, at any position, an uppercase letter Z cannot be found -((

Let’s consider, now, the regex (?-i)A*+Z. This time, the possessive part A*+ grabs all the letters A and there NO more text to match the Z regex,. And, as backtracking is not allowed, the regex fails faster and so, the regex engine can search, for other matches, further on, more quickly !

Of course, you don’t see any difference between the two cases but, for long texts and/or complicated regexes, this may be significant !!

Finally, here is, below, a summary table of all the quantifiers :

                     •--------------------------------------•
                     |             QUANTIFIERS              |
     •---------------•----------•------------•--------------•
     |  REPETITIONS  |  Greedy  |    Lazy    |  Possessive  |
	 •---------------•----------•------------•--------------•
     |  From n to ∞  |  {n,}    |   {n,}?    |    {n,}+     |
	 •---------------•----------•------------•--------------•
	 |  From n to m  |  {n,m}   |   {n,m}?   |    {n,m}+    |
	 •---------------•----------•------------•--------------•
     |  From 0 to 1  |    ?     |     ??     |      ?+      |
	 •---------------•----------•------------•--------------•
     |  From 0 to ∞  |    *     |     *?     |      *+      |
	 •---------------•----------•------------•--------------•
     |  From 1 to ∞  |    +     |     +?     |      ++      |
	 •---------------•----------•------------•--------------•
     |  From n to n  |               {n}                    |
     •---------------•--------------------------------------•

Note that the {n} quantifier cannot be qualified with the flavors Greedy, Lazy or Possessive. It just means exactly n times, the character or expression, right before !

Best Regards

guy038