Regex: select/match the numbers that are repeated most often
-
hello, I have 15 rows with 7 numbers, from 1 to 50. How can I match the 4 numbers that are repeated most often in all those 15 rows?
I suppose I must first select all numbers
\d+
then, I have to divide all 2-digit numbers\b[1-9]{2}\b
by all 1-digit numbers\b[1-9]{1}\b
or, I should select all numbers from1-10
, then all numbers from10-20
…and from40-50
I don’t know exactly, there should be a mathematics formula. In Excel I can use filters for this, or Sort from lowest to highest, etc
But how can I do this with regex?
-
@Vasile-Caraus
It isn’t entirely clear what you mean by “match”. Do you simply want to know which 4 numbers appear most often or something else altogether?If you know how to do this in Excel, then if I were you, I would import the document into Excel and do whatever it is you want done.
Otherwise, you should probably use a scripting language (AWK, PERL, Python, etc.). This doesn’t sound like a task that regex is best suited to do.
-
Hello, Vasile,
I tried to guess, first, what you wanted to achieve and after getting random numbers from Net, I spent some hours, from time to time, to imagine a method ! And, luckily, I succeeded to find a solution, with the help of the
Random.org
site, which allows you to obtain the most frequent integers used, in a table of 10,000 integers maximum, with value between 1 and 9999 maximumOn the
Random.org
site, the value or random numbers can be, in the range ±1,000,000,000, but, due to some necessary regexes, I preferred to limit this range, between 1 and 9999As your table of numbers contains 15 rows of 7 columns, the total number of integers, with value between 1 and 50 , is 105
So, go, first, to the
Random.org
site, from the address, below :https://www.random.org/integers/
I typed ( in
red
colour ) the following answers :-
Generate
105
random integers (maximum 10,000). -
Each integer should have a value between
1
and50
(both inclusive; limits ±1,000,000,000). -
Format in
7
column(s). -
Note: The numbers are generated left to right !
And I clicked on the Get Numbers button
I got a 15 x 7 table of 105 random integers, below, that I copied/pasted in a new tab, in N++
2 27 7 11 32 6 7 8 45 50 19 37 40 47 21 11 50 46 50 27 49 41 13 36 3 37 29 23 25 22 47 3 37 2 29 8 48 29 46 24 18 9 46 8 24 19 5 22 27 29 26 44 47 22 22 5 22 25 35 47 48 24 3 10 20 28 49 7 24 3 37 27 4 40 44 45 14 4 44 15 43 46 32 7 47 15 11 17 16 42 8 28 44 43 24 17 8 5 32 27 11 1 35 28 29
In that outputed list, the integers are separated with a single tabulation character. As I intended to sort these values, I needed, first, to put all the values, in a one column table.
Moreover, it was necessary to use a template, with possible leading zeros, in order to sort, later, these integers, correctly ! So :
- A one-digit integer, was changed into the integer
000#
- A two-digits integer, was changed into the integer
00##
- A three-digits integer, was changed into the integer
0###
- A four-digits integer, was changed into the integer
####
The regex S/R, which can realized these two goals, was :
SEARCH
^(\d(\d(\d(\d)?)?)?)(?:\t|\R)
REPLACE
(?2:0)(?3:0)(?4:0)\1\r\n
After clicking, ONCE, on the Replace All button, I got the list, of 105 integers, below :
0002 0027 0007 0011 0032 0006 0007 0008 0045 0050 0019 0037 0040 0047 0021 0011 0050 0046 0050 0027 0049 0041 0013 0036 0003 0037 0029 0023 0025 0022 0047 0003 0037 0002 0029 0008 0048 0029 0046 0024 0018 0009 0046 0008 0024 0019 0005 0022 0027 0029 0026 0044 0047 0022 0022 0005 0022 0025 0035 0047 0048 0024 0003 0010 0020 0028 0049 0007 0024 0003 0037 0027 0004 0040 0044 0045 0014 0004 0044 0015 0043 0046 0032 0007 0047 0015 0011 0017 0016 0042 0008 0028 0044 0043 0024 0017 0008 0005 0032 0027 0011 0001 0035 0028 0029
Using the menu option Edit > Line Operations > Sort Lines Lexicographically Ascending, I obtained the sorted text, below :
0001 0002 0002 0003 0003 0003 0003 0004 0004 0005 0005 0005 0006 0007 0007 0007 0007 0008 0008 0008 0008 0008 0009 0010 0011 0011 0011 0011 0013 0014 0015 0015 0016 0017 0017 0018 0019 0019 0020 0021 0022 0022 0022 0022 0022 0023 0024 0024 0024 0024 0024 0025 0025 0026 0027 0027 0027 0027 0027 0028 0028 0028 0029 0029 0029 0029 0029 0032 0032 0032 0035 0035 0036 0037 0037 0037 0037 0040 0040 0041 0042 0043 0043 0044 0044 0044 0044 0045 0045 0046 0046 0046 0046 0047 0047 0047 0047 0047 0048 0048 0049 0049 0050 0050 0050
Then, I found a regex, in order to put all the same numbers, in an unique line. For instance, the four numbers 0003, in four consecutive lines, were displayed, after replacement, in the single line 0003 0003 0003 0003. So :
SEARCH
(\d{4})\R\1
REPLACE
\1 \1
, with a space character, between the two back-references,\1
IMPORTANT : You must click, TWICE, on the Replace All button, in order to end this S/R
REMARK :
- If each number occurs ONCE or TWICE, only, in the current random list, you may, already, get the message : Replace All: 0 occurrences were replaced, while clicking a second time, on the Replace All button !
Thus, after TWO clicks on the Replace All button, that list was changed into this new one, below :
0001 0002 0002 0003 0003 0003 0003 0004 0004 0005 0005 0005 0006 0007 0007 0007 0007 0008 0008 0008 0008 0008 0009 0010 0011 0011 0011 0011 0013 0014 0015 0015 0016 0017 0017 0018 0019 0019 0020 0021 0022 0022 0022 0022 0022 0023 0024 0024 0024 0024 0024 0025 0025 0026 0027 0027 0027 0027 0027 0028 0028 0028 0029 0029 0029 0029 0029 0032 0032 0032 0035 0035 0036 0037 0037 0037 0037 0040 0040 0041 0042 0043 0043 0044 0044 0044 0044 0045 0045 0046 0046 0046 0046 0047 0047 0047 0047 0047 0048 0048 0049 0049 0050 0050 0050
Finally, I had to get rid of all the numbers, which were present, less than four times ! Indeed, only the integers, repeated, at least, four times, in that list, seemed useful. The suitable S/R to do so, is :
SEARCH
^(?!(\d{4})( \1){3}).+\R
REPLACE
EMPTY
NOTE :
-
The general regex
^(?!(\d{4})( \1){N}).+\R
, delete all the lines, where current number is present, between 1 and N times, maximum. So :- If N = 1, every number, present ONCE, in the list, will be deleted
- If N = 2, every number, present ONCE or TWICE, in the list, will be deleted
- If N = 3, every number, present ONCE, TWICE or THREE times, in the list, will be deleted
- If N = 4, every number, present, between ONCE and FOUR times, in the list, will be deleted
- And so on…
After clicking ONCE, on the Replace All button, I got the final text, below :
0003 0003 0003 0003 0007 0007 0007 0007 0008 0008 0008 0008 0008 0011 0011 0011 0011 0022 0022 0022 0022 0022 0024 0024 0024 0024 0024 0027 0027 0027 0027 0027 0029 0029 0029 0029 0029 0037 0037 0037 0037 0044 0044 0044 0044 0046 0046 0046 0046 0047 0047 0047 0047 0047
Finally, from this text, it’s quite obvious to deduce that the more frequent numbers, in that random list of 105 numbers, are the six integers 8, 22, 24, 27, 29 and 47, which are present five times :-))
A second example :
I will not give details about it. I’ll just give the original random list of integers and the final list of the most frequent integers found
Let’s suppose a list of 300 integers, with values from 1 to 150, placed in 15 rows of 20 columns, each, below :
56 142 24 68 122 132 35 127 56 29 119 97 3 143 21 72 138 109 18 124 51 42 144 5 100 39 60 12 101 94 16 118 108 61 29 125 150 67 60 57 22 82 148 9 29 111 138 123 108 130 47 1 141 75 107 124 58 24 47 46 121 78 107 51 92 21 114 75 105 62 114 7 89 77 63 39 21 131 126 107 50 13 85 26 33 103 112 74 122 62 11 86 22 90 53 143 74 122 26 109 96 128 148 85 3 18 88 132 90 86 150 118 80 20 41 147 91 6 3 45 143 139 145 52 150 111 132 73 86 30 125 28 66 24 61 41 76 108 16 51 138 78 50 52 125 88 11 145 13 25 111 15 103 124 94 2 1 80 74 6 58 14 78 6 27 39 75 117 69 98 53 1 71 11 60 15 21 115 129 2 10 147 8 45 20 90 41 29 3 101 44 116 52 39 141 132 102 33 57 110 21 43 16 33 51 59 78 116 116 23 50 18 114 106 8 93 96 25 6 71 6 31 58 49 114 91 17 9 30 99 113 137 16 131 29 102 40 133 34 147 98 7 81 127 136 132 126 69 48 5 54 128 94 85 11 134 71 92 108 37 54 121 118 65 124 58 122 130 67 77 26 65 136 14 149 146 117 54 60 20 147 103 28 129 32 94 139 111 122 74 146 86 83 100 75 100 48 48 99 112
At the end, after the third regex S/R , you should get the final text, below :
0003 0003 0003 0003 0006 0006 0006 0006 0006 0011 0011 0011 0011 0016 0016 0016 0016 0021 0021 0021 0021 0021 0029 0029 0029 0029 0029 0039 0039 0039 0039 0051 0051 0051 0051 0058 0058 0058 0058 0060 0060 0060 0060 0074 0074 0074 0074 0075 0075 0075 0075 0078 0078 0078 0078 0086 0086 0086 0086 0094 0094 0094 0094 0108 0108 0108 0108 0111 0111 0111 0111 0114 0114 0114 0114 0122 0122 0122 0122 0122 0124 0124 0124 0124 0132 0132 0132 0132 0132 0147 0147 0147 0147
Now, not difficult to see that the more frequent numbers, in that random list of 300 numbers, between 1 and 150, are the five integers 6, 21, 29, 122 and 132, which are present five times :-))
A third example ( without explanations, just try ! )
Let’s suppose a list of 100 integers, with values from 1 to 999, placed in 10 rows of 10 columns, each, below :
591 132 551 647 337 570 610 427 281 868 266 424 760 306 46 262 239 178 11 752 236 97 50 415 237 198 444 63 77 602 189 562 36 334 822 704 759 242 651 306 39 998 172 606 973 846 854 687 759 304 865 50 5 583 685 888 510 468 742 144 612 948 538 802 531 657 300 779 817 392 227 231 984 466 670 203 852 879 164 775 362 211 981 675 889 273 86 184 485 643 180 390 690 292 906 902 245 933 679 931
The last S/R is, even, useless, because the numbers are, mostly, present ONCE, only !
=> The most frequent numbers, in that random list of 100 numbers, between 1 and 999, are the three integers 50, 306 and 759, which are present two times !
A final example :
Let’s suppose a list of 1000 integers, with values from 1 to 30, placed in 50 rows of 20 columns, each, below :
14 3 10 12 28 16 19 10 3 25 2 14 8 8 27 8 1 20 27 13 25 30 5 13 25 8 9 29 4 7 19 7 13 18 18 23 25 8 15 4 7 17 15 27 17 1 19 12 5 22 7 18 2 20 11 6 22 26 2 20 22 20 8 27 26 26 6 29 19 22 17 12 22 7 27 1 16 24 3 29 26 7 9 16 2 8 3 11 5 17 4 20 2 5 16 11 17 7 2 1 15 20 11 11 5 11 18 24 3 10 2 30 29 23 17 21 14 12 5 11 27 10 16 2 15 22 26 8 12 21 18 16 4 2 5 27 18 28 17 3 10 2 27 4 20 19 14 11 18 16 29 2 11 7 1 29 29 6 18 26 26 10 30 21 6 10 7 6 30 27 2 5 25 25 22 24 17 8 16 21 13 27 16 19 16 21 28 23 30 24 12 24 5 30 14 5 21 2 22 11 20 2 19 21 29 23 21 8 21 15 26 22 28 22 13 27 1 6 14 7 11 20 3 17 9 4 9 5 7 18 21 20 11 14 21 22 6 29 22 21 21 25 7 20 28 18 1 30 4 25 28 10 24 23 8 9 17 24 6 11 21 10 28 24 1 24 29 8 7 28 1 14 10 23 14 12 28 30 21 11 13 11 3 18 30 15 2 13 29 14 22 17 30 16 17 9 24 8 11 23 29 7 21 3 25 23 17 28 25 30 26 19 25 29 6 15 20 9 30 17 23 26 30 16 5 21 22 13 24 24 16 27 24 5 1 28 25 26 21 11 9 5 3 23 19 3 7 30 3 9 25 29 12 3 14 19 23 25 26 20 6 9 14 15 12 27 2 2 27 28 23 25 13 1 13 16 24 10 28 6 5 8 5 6 24 20 22 15 9 6 19 26 27 15 15 21 12 24 27 9 22 5 18 18 23 25 20 7 9 7 21 21 24 19 21 1 7 14 20 8 5 7 23 3 26 10 8 27 26 3 5 2 27 15 29 2 28 18 5 19 19 18 14 26 15 23 2 18 4 7 5 30 5 9 8 17 27 2 24 21 21 27 11 25 20 5 28 4 26 3 9 13 4 22 26 4 30 9 13 14 24 29 11 6 26 20 30 1 2 11 2 7 20 10 3 26 4 3 4 27 26 30 4 9 13 9 15 28 23 1 10 1 3 30 27 29 4 28 11 8 3 1 27 23 30 30 6 14 15 28 7 29 24 8 23 8 4 15 24 10 17 18 27 19 17 29 25 7 5 8 21 22 24 8 15 16 10 29 7 12 1 18 19 3 22 1 13 16 26 27 4 3 16 30 7 13 14 8 28 4 17 10 8 11 6 8 13 13 27 19 14 21 28 26 26 20 26 5 30 14 22 23 9 28 11 21 12 3 11 7 26 16 14 4 20 24 15 12 13 4 12 24 8 9 25 1 29 5 24 24 13 1 5 26 14 19 12 27 19 17 12 14 7 6 3 26 24 11 19 1 1 2 3 13 19 8 18 14 3 13 29 25 14 30 12 22 14 14 20 12 2 2 13 26 7 28 12 26 2 13 13 23 22 6 11 1 25 23 12 18 24 1 10 17 23 4 28 14 6 13 27 7 25 2 25 27 12 14 10 7 8 9 19 1 19 14 10 29 17 5 9 8 30 12 25 16 3 14 26 30 7 27 2 15 3 28 4 11 6 2 28 13 3 14 15 18 22 11 18 30 19 6 24 30 22 14 8 29 2 13 27 2 1 8 23 24 5 1 1 24 23 17 6 25 17 2 16 26 19 13 18 22 21 27 10 13 7 27 4 8 30 15 11 3 27 26 22 22 5 17 14 28 27 14 11 2 14 8 26 4 2 28 4 25 29 10 16 23 6 10 21 23 4 19 25 13 4 26 8 3 27 2 19 2 30 8 25 1 1 4 8 15 19 19 25 4 7 7 21 13 24 21 26 13 14 22 6 9 10 26 7 29 25 17 11 4 8 30 26 6 5 8 23 16 13 23 17 2 21 4 24 4 13 25 12 12 13 16 19 11 19 11 30 6 19 7 12 10 18 14 1 7 20 19 28 1 28 6 7 9 21 7 11 9 10 7 1 16 27 20 27 16 30 21 23 25 25 5 22 13 15 27 26 22 4 28 13 25 18 29 7 5 25 19 28 19 20 18 10 1 30 24 13 13 29 16 8 8 15 25 7 20 12 18 9 9 17 13 19 18 29 9 14 3 20 29 28 18 21 19 18 21 4 15 20 7 20 24 6 27 3 10 27 14 15 7 4 22 7 17
For the last S/R, I chose N = 38, because there are, only, 30 possible values and most numbers are, therefore, present, very often !
Hence, the last regex S/R is :
SEARCH
^(?!(\d{4})( \1){38}).+\R
REPLACE
EMPTY
=> The most frequent numbers, in that random list of 1000 numbers, between 1 and 30, are the six integers, below :
7 ( present 45 times ), 8 and 13 ( present 40 times ), 14 and 26 ( present 39 times ) and 27 ( present 41 times ) !
Best Regards,
guy038
-
-
hello Guy38. I must say…I never thing about this method.
But, you are the best.
Thanks A LOT ! WORKS !
-
BUT, the only problem is that works on your exemples. Not at mine.
the
\R
from your regular expressions can be replace with other formula? -
This post is deleted! -
@guy038 said:
SEARCH ^(\d(\d(\d(\d)?)?)?)(?:\t|\R)
REPLACE (?2:0)(?3:0)(?4:0)\1\r\nthis regex of your
^(\d(\d(\d(\d)?)?)?)(?:\t|\R)
doesn’t work at my place. The first one and the most important. The other regex works fine.But I find another way to do this. Suppose I have:
17 25 30 37 38 47
2 6 7 17 30 42
3 17 20 38 44 45
4 5 6 30 36 42Search:
(Leave a single space)
Replace by:\r
then
Search:
^(a*)
This will move the cursor at the beginning of each line
Replace by: 00and I will get something like this:
0017
0025
0030
0037
0038
0047
002
006
007
0017
0030
0042
003
0017
0020
0038
0044
0045
004
005
006
0030
0036
0042 -
@guy038 said:
SEARCH (\d{4})\R\1
REPLACE \1 \1 , with a space character, between the two back-references, \1
This, again, is not working at my place.
(\d{4})\R\1
And I press many time “Replace All” button -
I know you are a regex fan but just to give you an idea how a python script
would look like to solve such a problemfrom collections import Counter x = editor.getText().replace('\r\n',' ').split(' ') # get the list of numbers y = [y for y in x if y !=''] # get rid of the empty ones counted_list = Counter(y) # create a list of tuples, counting each for item in counted_list.most_common(4): # iterate over the top 4 console.write('{}\n'.format(item)) # and print it to the console
I used the list of 1000 integer @guy038 posted.
The result in the console would be(‘7’, 45)
(‘27’, 41)
(‘8’, 40)
(‘13’, 40)Meaning that number 7 occurred 45 times
Cheers
Claudia -
@Claudia-Frank said:
n idea how a pytho
hello Claudia, I don’t know Phyton, so I really don’t know what to do with the phyton script you write above.
-
Hello Claudia,
I’ve just tested, your Python solution, changing for the six most common used numbers, with the
counted_list.most_common(6)
expression and it just return all the numbers that I’ve had previously found, for the 1000 random integers list :-)How elegant a Python ( or Lua, I suppose ) script is, compared to my complicated regex’s cooking !!!
Cheers,
guy038
-
Claudia and guy038, please tell me how to use this python script !
-
a short tutorial for this example will be great !
-
What needs to be done first is described here .
Just in case that you haven’t installed python script plugin yet, I would propose to use the MSI package instead of using the plugin manager.
Short version, once python script plugin has been installed goto
Plugins->Python Script->New Script
give it a name and press save.
A new empty editor should appear.
Copy the content into it and save it.
Do NOT reformat the code as python is strict about whitespaces.Open the python script console by clicking on
Plugins->Python Script->Show ConsoleOpen your file with the numbers and run the script by clicking on
Plugins->Python Script->Scripts->NAME_OF_YOUR_SCRIPT
Cheers
Claudia -
WORKS GREAT Claudia.
Thanks a lot !
-
by the way, Claudia, how can I use Python (like your script) to actually modify the .txt file. Because, for now, Python only show in the console the results of some function from the script. But how can I use Python script to search and replace something in the .txt files?
-
if you want to dive into python first thing, of course, is to get some basic knowledge of the language it self.
Either use one of the youtube videos or if you prefer to read https://www.python.org/about/gettingstarted/ .
Note, the plugin uses python2 NOT 3 (there are differences, nothing too critical but those can be confusing
if you start learning the language and you try to do something which works in py3 but not in py2).Next the help pages which come with the plugin itself.
Plugins->Python Script->Context-HelpAnd last but not least Scintillas help at http://www.scintilla.org/ScintillaDoc.html to get a better
understanding how the editor works.The console is a good starting point to test things first.
In order to get all functions, attributes of a py object you can use the dir command.
So, if you do the following in the console you will get the list of functions of this objectdir(editor)
I prefer to have not to scroll sideways so I use
print '\n'.join(dir(editor))
In order to see what the parameters of a function are use the help command like
help(editor.insertText)
Next if you search the forum you will find many scripts to solve some particular issues
one of my first posts answered a question to unit conversion
https://notepad-plus-plus.org/community/topic/10966/unit-conversion-plugin/13and finally, ask the question here if you have a specifc question.
Cheers
ClaudiaAhh… I would suggest to do the following changes in notepad
Settings->Preferences->Language check the “replace by space” because
Python don’t like it if you use tabs and spaces for indentation. -
Regarding print ‘\n’.join(dir(editor))
I don’t think that ‘print’ outputs to the Pythonscript console window by default.
From the following in the original startup.py :
# This sets the stdout to be the currently active document, so print “hello world”,
# will insert “hello world” at the current cursor position of the current document
sys.stdout = editorThis is of dubious value, especially since a ‘print’ used in this way inserts the text specified plus a UNIX-style line ending into your current file (which likely has Windows-style line endings!).
I, and likely also Claudia, have changed this line in startup.py to be:
sys.stdout = console
thus changing ‘print’ statements to output their data to the Pythonscript console (great for debugging your scripts!)
As alluded to above, the Pythonscript console seems to use UNIX-style line endings. I found this out in an odd way. If you copy-and-paste from the console to an editing window with Windows line endings, the line-endings on the source text will be changed at the time of the paste to match the destination file format, so all is good. HOWEVER, what I did one time was to paste via the “Clipboard History” window. This action seems to preserve the original UNIX-style line endings at the destination! I was quite confused as to why I had inconsistent line-endings in my document, until I figured it out.
-
Scott, you are absolutely correct, I’ve changed this in startup.py
and for me this is much more convenient than using console.write to
print chars to the console.
Just a side not, the command
print ‘\n’.join(dir(editor))
should have been executed in the console itself and there it is working
but if some would use it in a script, than it would print to editor unless
you do changes Scott mentioned.Thx for the info about copy/paste - I do this a lot but luckily I didn’t use the history ;-)
Cheers
Claudia