# Regex: select/match the numbers that are repeated most often

• hello, I have 15 rows with 7 numbers, from 1 to 50. How can I match the 4 numbers that are repeated most often in all those 15 rows?

I suppose I must first select all numbers `\d+`
then, I have to divide all 2-digit numbers `\b[1-9]{2}\b` by all 1-digit numbers `\b[1-9]{1}\b`
or, I should select all numbers from `1-10`, then all numbers from `10-20` …and from `40-50`

I don’t know exactly, there should be a mathematics formula. In Excel I can use filters for this, or Sort from lowest to highest, etc

But how can I do this with regex?

• @Vasile-Caraus
It isn’t entirely clear what you mean by “match”. Do you simply want to know which 4 numbers appear most often or something else altogether?

If you know how to do this in Excel, then if I were you, I would import the document into Excel and do whatever it is you want done.

Otherwise, you should probably use a scripting language (AWK, PERL, Python, etc.). This doesn’t sound like a task that regex is best suited to do.

• Hello, Vasile,

I tried to guess, first, what you wanted to achieve and after getting random numbers from Net, I spent some hours, from time to time, to imagine a method ! And, luckily, I succeeded to find a solution, with the help of the `Random.org` site, which allows you to obtain the most frequent integers used, in a table of 10,000 integers maximum, with value between 1 and 9999 maximum

On the `Random.org` site, the value or random numbers can be, in the range ±1,000,000,000, but, due to some necessary regexes, I preferred to limit this range, between 1 and 9999

As your table of numbers contains 15 rows of 7 columns, the total number of integers, with value between 1 and 50 , is 105

So, go, first, to the `Random.org` site, from the address, below :

https://www.random.org/integers/

I typed ( in `red` colour ) the following answers :

• Generate `105` random integers (maximum 10,000).

• Each integer should have a value between `1` and `50` (both inclusive; limits ±1,000,000,000).

• Format in `7` column(s).

• Note: The numbers are generated left to right !

And I clicked on the Get Numbers button

I got a 15 x 7 table of 105 random integers, below, that I copied/pasted in a new tab, in N++

``````2	27	7	11	32	6	7
8	45	50	19	37	40	47
21	11	50	46	50	27	49
41	13	36	3	37	29	23
25	22	47	3	37	2	29
8	48	29	46	24	18	9
46	8	24	19	5	22	27
29	26	44	47	22	22	5
22	25	35	47	48	24	3
10	20	28	49	7	24	3
37	27	4	40	44	45	14
4	44	15	43	46	32	7
47	15	11	17	16	42	8
28	44	43	24	17	8	5
32	27	11	1	35	28	29
``````

In that outputed list, the integers are separated with a single tabulation character. As I intended to sort these values, I needed, first, to put all the values, in a one column table.

Moreover, it was necessary to use a template, with possible leading zeros, in order to sort, later, these integers, correctly ! So :

• A one-digit integer, was changed into the integer `000#`
• A two-digits integer, was changed into the integer `00##`
• A three-digits integer, was changed into the integer `0###`
• A four-digits integer, was changed into the integer `####`

The regex S/R, which can realized these two goals, was :

SEARCH `^(\d(\d(\d(\d)?)?)?)(?:\t|\R)`

REPLACE `(?2:0)(?3:0)(?4:0)\1\r\n`

After clicking, ONCE, on the Replace All button, I got the list, of 105 integers, below :

``````0002
0027
0007
0011
0032
0006
0007
0008
0045
0050
0019
0037
0040
0047
0021
0011
0050
0046
0050
0027
0049
0041
0013
0036
0003
0037
0029
0023
0025
0022
0047
0003
0037
0002
0029
0008
0048
0029
0046
0024
0018
0009
0046
0008
0024
0019
0005
0022
0027
0029
0026
0044
0047
0022
0022
0005
0022
0025
0035
0047
0048
0024
0003
0010
0020
0028
0049
0007
0024
0003
0037
0027
0004
0040
0044
0045
0014
0004
0044
0015
0043
0046
0032
0007
0047
0015
0011
0017
0016
0042
0008
0028
0044
0043
0024
0017
0008
0005
0032
0027
0011
0001
0035
0028
0029
``````

Using the menu option Edit > Line Operations > Sort Lines Lexicographically Ascending, I obtained the sorted text, below :

``````0001
0002
0002
0003
0003
0003
0003
0004
0004
0005
0005
0005
0006
0007
0007
0007
0007
0008
0008
0008
0008
0008
0009
0010
0011
0011
0011
0011
0013
0014
0015
0015
0016
0017
0017
0018
0019
0019
0020
0021
0022
0022
0022
0022
0022
0023
0024
0024
0024
0024
0024
0025
0025
0026
0027
0027
0027
0027
0027
0028
0028
0028
0029
0029
0029
0029
0029
0032
0032
0032
0035
0035
0036
0037
0037
0037
0037
0040
0040
0041
0042
0043
0043
0044
0044
0044
0044
0045
0045
0046
0046
0046
0046
0047
0047
0047
0047
0047
0048
0048
0049
0049
0050
0050
0050
``````

Then, I found a regex, in order to put all the same numbers, in an unique line. For instance, the four numbers 0003, in four consecutive lines, were displayed, after replacement, in the single line 0003 0003 0003 0003. So :

SEARCH `(\d{4})\R\1`

REPLACE `\1 \1` , with a space character, between the two back-references, `\1`

IMPORTANT : You must click, TWICE, on the Replace All button, in order to end this S/R

REMARK :

• If each number occurs ONCE or TWICE, only, in the current random list, you may, already, get the message : Replace All: 0 occurrences were replaced, while clicking a second time, on the Replace All button !

Thus, after TWO clicks on the Replace All button, that list was changed into this new one, below :

``````0001
0002 0002
0003 0003 0003 0003
0004 0004
0005 0005 0005
0006
0007 0007 0007 0007
0008 0008 0008 0008 0008
0009
0010
0011 0011 0011 0011
0013
0014
0015 0015
0016
0017 0017
0018
0019 0019
0020
0021
0022 0022 0022 0022 0022
0023
0024 0024 0024 0024 0024
0025 0025
0026
0027 0027 0027 0027 0027
0028 0028 0028
0029 0029 0029 0029 0029
0032 0032 0032
0035 0035
0036
0037 0037 0037 0037
0040 0040
0041
0042
0043 0043
0044 0044 0044 0044
0045 0045
0046 0046 0046 0046
0047 0047 0047 0047 0047
0048 0048
0049 0049
0050 0050 0050
``````

Finally, I had to get rid of all the numbers, which were present, less than four times ! Indeed, only the integers, repeated, at least, four times, in that list, seemed useful. The suitable S/R to do so, is :

SEARCH `^(?!(\d{4})( \1){3}).+\R`

REPLACE `EMPTY`

NOTE :

• The general regex `^(?!(\d{4})( \1){N}).+\R`, delete all the lines, where current number is present, between 1 and N times, maximum. So :

• If N = 1, every number, present ONCE, in the list, will be deleted
• If N = 2, every number, present ONCE or TWICE, in the list, will be deleted
• If N = 3, every number, present ONCE, TWICE or THREE times, in the list, will be deleted
• If N = 4, every number, present, between ONCE and FOUR times, in the list, will be deleted
• And so on…

After clicking ONCE, on the Replace All button, I got the final text, below :

``````0003 0003 0003 0003
0007 0007 0007 0007
0008 0008 0008 0008 0008
0011 0011 0011 0011
0022 0022 0022 0022 0022
0024 0024 0024 0024 0024
0027 0027 0027 0027 0027
0029 0029 0029 0029 0029
0037 0037 0037 0037
0044 0044 0044 0044
0046 0046 0046 0046
0047 0047 0047 0047 0047
``````

Finally, from this text, it’s quite obvious to deduce that the more frequent numbers, in that random list of 105 numbers, are the six integers 8, 22, 24, 27, 29 and 47, which are present five times :-))

A second example :

I will not give details about it. I’ll just give the original random list of integers and the final list of the most frequent integers found

Let’s suppose a list of 300 integers, with values from 1 to 150, placed in 15 rows of 20 columns, each, below :

``````56	142	24	68	122	132	35	127	56	29	119	97	3	143	21	72	138	109	18	124
51	42	144	5	100	39	60	12	101	94	16	118	108	61	29	125	150	67	60	57
22	82	148	9	29	111	138	123	108	130	47	1	141	75	107	124	58	24	47	46
121	78	107	51	92	21	114	75	105	62	114	7	89	77	63	39	21	131	126	107
50	13	85	26	33	103	112	74	122	62	11	86	22	90	53	143	74	122	26	109
96	128	148	85	3	18	88	132	90	86	150	118	80	20	41	147	91	6	3	45
143	139	145	52	150	111	132	73	86	30	125	28	66	24	61	41	76	108	16	51
138	78	50	52	125	88	11	145	13	25	111	15	103	124	94	2	1	80	74	6
58	14	78	6	27	39	75	117	69	98	53	1	71	11	60	15	21	115	129	2
10	147	8	45	20	90	41	29	3	101	44	116	52	39	141	132	102	33	57	110
21	43	16	33	51	59	78	116	116	23	50	18	114	106	8	93	96	25	6	71
6	31	58	49	114	91	17	9	30	99	113	137	16	131	29	102	40	133	34	147
98	7	81	127	136	132	126	69	48	5	54	128	94	85	11	134	71	92	108	37
54	121	118	65	124	58	122	130	67	77	26	65	136	14	149	146	117	54	60	20
147	103	28	129	32	94	139	111	122	74	146	86	83	100	75	100	48	48	99	112
``````

At the end, after the third regex S/R , you should get the final text, below :

``````0003 0003 0003 0003
0006 0006 0006 0006 0006
0011 0011 0011 0011
0016 0016 0016 0016
0021 0021 0021 0021 0021
0029 0029 0029 0029 0029
0039 0039 0039 0039
0051 0051 0051 0051
0058 0058 0058 0058
0060 0060 0060 0060
0074 0074 0074 0074
0075 0075 0075 0075
0078 0078 0078 0078
0086 0086 0086 0086
0094 0094 0094 0094
0108 0108 0108 0108
0111 0111 0111 0111
0114 0114 0114 0114
0122 0122 0122 0122 0122
0124 0124 0124 0124
0132 0132 0132 0132 0132
0147 0147 0147 0147
``````

Now, not difficult to see that the more frequent numbers, in that random list of 300 numbers, between 1 and 150, are the five integers 6, 21, 29, 122 and 132, which are present five times :-))

A third example ( without explanations, just try ! )

Let’s suppose a list of 100 integers, with values from 1 to 999, placed in 10 rows of 10 columns, each, below :

``````591	132	551	647	337	570	610	427	281	868
266	424	760	306	46	262	239	178	11	752
236	97	50	415	237	198	444	63	77	602
189	562	36	334	822	704	759	242	651	306
39	998	172	606	973	846	854	687	759	304
865	50	5	583	685	888	510	468	742	144
612	948	538	802	531	657	300	779	817	392
227	231	984	466	670	203	852	879	164	775
362	211	981	675	889	273	86	184	485	643
180	390	690	292	906	902	245	933	679	931
``````

The last S/R is, even, useless, because the numbers are, mostly, present ONCE, only !

=> The most frequent numbers, in that random list of 100 numbers, between 1 and 999, are the three integers 50, 306 and 759, which are present two times !

A final example :

Let’s suppose a list of 1000 integers, with values from 1 to 30, placed in 50 rows of 20 columns, each, below :

``````14	3	10	12	28	16	19	10	3	25	2	14	8	8	27	8	1	20	27	13
25	30	5	13	25	8	9	29	4	7	19	7	13	18	18	23	25	8	15	4
7	17	15	27	17	1	19	12	5	22	7	18	2	20	11	6	22	26	2	20
22	20	8	27	26	26	6	29	19	22	17	12	22	7	27	1	16	24	3	29
26	7	9	16	2	8	3	11	5	17	4	20	2	5	16	11	17	7	2	1
15	20	11	11	5	11	18	24	3	10	2	30	29	23	17	21	14	12	5	11
27	10	16	2	15	22	26	8	12	21	18	16	4	2	5	27	18	28	17	3
10	2	27	4	20	19	14	11	18	16	29	2	11	7	1	29	29	6	18	26
26	10	30	21	6	10	7	6	30	27	2	5	25	25	22	24	17	8	16	21
13	27	16	19	16	21	28	23	30	24	12	24	5	30	14	5	21	2	22	11
20	2	19	21	29	23	21	8	21	15	26	22	28	22	13	27	1	6	14	7
11	20	3	17	9	4	9	5	7	18	21	20	11	14	21	22	6	29	22	21
21	25	7	20	28	18	1	30	4	25	28	10	24	23	8	9	17	24	6	11
21	10	28	24	1	24	29	8	7	28	1	14	10	23	14	12	28	30	21	11
13	11	3	18	30	15	2	13	29	14	22	17	30	16	17	9	24	8	11	23
29	7	21	3	25	23	17	28	25	30	26	19	25	29	6	15	20	9	30	17
23	26	30	16	5	21	22	13	24	24	16	27	24	5	1	28	25	26	21	11
9	5	3	23	19	3	7	30	3	9	25	29	12	3	14	19	23	25	26	20
6	9	14	15	12	27	2	2	27	28	23	25	13	1	13	16	24	10	28	6
5	8	5	6	24	20	22	15	9	6	19	26	27	15	15	21	12	24	27	9
22	5	18	18	23	25	20	7	9	7	21	21	24	19	21	1	7	14	20	8
5	7	23	3	26	10	8	27	26	3	5	2	27	15	29	2	28	18	5	19
19	18	14	26	15	23	2	18	4	7	5	30	5	9	8	17	27	2	24	21
21	27	11	25	20	5	28	4	26	3	9	13	4	22	26	4	30	9	13	14
24	29	11	6	26	20	30	1	2	11	2	7	20	10	3	26	4	3	4	27
26	30	4	9	13	9	15	28	23	1	10	1	3	30	27	29	4	28	11	8
3	1	27	23	30	30	6	14	15	28	7	29	24	8	23	8	4	15	24	10
17	18	27	19	17	29	25	7	5	8	21	22	24	8	15	16	10	29	7	12
1	18	19	3	22	1	13	16	26	27	4	3	16	30	7	13	14	8	28	4
17	10	8	11	6	8	13	13	27	19	14	21	28	26	26	20	26	5	30	14
22	23	9	28	11	21	12	3	11	7	26	16	14	4	20	24	15	12	13	4
12	24	8	9	25	1	29	5	24	24	13	1	5	26	14	19	12	27	19	17
12	14	7	6	3	26	24	11	19	1	1	2	3	13	19	8	18	14	3	13
29	25	14	30	12	22	14	14	20	12	2	2	13	26	7	28	12	26	2	13
13	23	22	6	11	1	25	23	12	18	24	1	10	17	23	4	28	14	6	13
27	7	25	2	25	27	12	14	10	7	8	9	19	1	19	14	10	29	17	5
9	8	30	12	25	16	3	14	26	30	7	27	2	15	3	28	4	11	6	2
28	13	3	14	15	18	22	11	18	30	19	6	24	30	22	14	8	29	2	13
27	2	1	8	23	24	5	1	1	24	23	17	6	25	17	2	16	26	19	13
18	22	21	27	10	13	7	27	4	8	30	15	11	3	27	26	22	22	5	17
14	28	27	14	11	2	14	8	26	4	2	28	4	25	29	10	16	23	6	10
21	23	4	19	25	13	4	26	8	3	27	2	19	2	30	8	25	1	1	4
8	15	19	19	25	4	7	7	21	13	24	21	26	13	14	22	6	9	10	26
7	29	25	17	11	4	8	30	26	6	5	8	23	16	13	23	17	2	21	4
24	4	13	25	12	12	13	16	19	11	19	11	30	6	19	7	12	10	18	14
1	7	20	19	28	1	28	6	7	9	21	7	11	9	10	7	1	16	27	20
27	16	30	21	23	25	25	5	22	13	15	27	26	22	4	28	13	25	18	29
7	5	25	19	28	19	20	18	10	1	30	24	13	13	29	16	8	8	15	25
7	20	12	18	9	9	17	13	19	18	29	9	14	3	20	29	28	18	21	19
18	21	4	15	20	7	20	24	6	27	3	10	27	14	15	7	4	22	7	17
``````

For the last S/R, I chose N = 38, because there are, only, 30 possible values and most numbers are, therefore, present, very often !

Hence, the last regex S/R is :

SEARCH `^(?!(\d{4})( \1){38}).+\R`

REPLACE `EMPTY`

=> The most frequent numbers, in that random list of 1000 numbers, between 1 and 30, are the six integers, below :

7 ( present 45 times ), 8 and 13 ( present 40 times ), 14 and 26 ( present 39 times ) and 27 ( present 41 times ) !

Best Regards,

guy038

But, you are the best.

Thanks A LOT ! WORKS !

• BUT, the only problem is that works on your exemples. Not at mine.

the `\R` from your regular expressions can be replace with other formula?

• This post is deleted!

• @guy038 said:

SEARCH ^(\d(\d(\d(\d)?)?)?)(?:\t|\R)
REPLACE (?2:0)(?3:0)(?4:0)\1\r\n

this regex of your `^(\d(\d(\d(\d)?)?)?)(?:\t|\R)` doesn’t work at my place. The first one and the most important. The other regex works fine.

But I find another way to do this. Suppose I have:

17 25 30 37 38 47
2 6 7 17 30 42
3 17 20 38 44 45
4 5 6 30 36 42

Search: `(Leave a single space)`
Replace by: `\r`

then

Search: `^(a*)` This will move the cursor at the beginning of each line
Replace by: 00

and I will get something like this:

0017
0025
0030
0037
0038
0047
002
006
007
0017
0030
0042
003
0017
0020
0038
0044
0045
004
005
006
0030
0036
0042

• @guy038 said:

SEARCH (\d{4})\R\1

REPLACE \1 \1 , with a space character, between the two back-references, \1

This, again, is not working at my place. `(\d{4})\R\1` And I press many time “Replace All” button

• @Vasile-Caraus

I know you are a regex fan but just to give you an idea how a python script
would look like to solve such a problem

``````from collections import Counter

x = editor.getText().replace('\r\n',' ').split(' ')  # get the list of numbers
y = [y for y in x if y !='']                         # get rid of the empty ones
counted_list = Counter(y)                            # create a list of tuples, counting each
for item in counted_list.most_common(4):             # iterate over the top 4
console.write('{}\n'.format(item))               # and print it to the console
``````

I used the list of 1000 integer @guy038 posted.
The result in the console would be

(‘7’, 45)
(‘27’, 41)
(‘8’, 40)
(‘13’, 40)

Meaning that number 7 occurred 45 times

Cheers
Claudia

• @Claudia-Frank said:

n idea how a pytho

hello Claudia, I don’t know Phyton, so I really don’t know what to do with the phyton script you write above.

• Hello Claudia,

I’ve just tested, your Python solution, changing for the six most common used numbers, with the `counted_list.most_common(6)` expression and it just return all the numbers that I’ve had previously found, for the 1000 random integers list :-)

How elegant a Python ( or Lua, I suppose ) script is, compared to my complicated regex’s cooking !!!

Cheers,

guy038

• Claudia and guy038, please tell me how to use this python script !

• a short tutorial for this example will be great !

• @Vasile-Caraus

What needs to be done first is described here.

Just in case that you haven’t installed python script plugin yet, I would propose to use the MSI package instead of using the plugin manager.

Short version, once python script plugin has been installed goto
Plugins->Python Script->New Script
give it a name and press save.
A new empty editor should appear.
Copy the content into it and save it.
Do NOT reformat the code as python is strict about whitespaces.

Open the python script console by clicking on
Plugins->Python Script->Show Console

Open your file with the numbers and run the script by clicking on
Plugins->Python Script->Scripts->NAME_OF_YOUR_SCRIPT
Cheers
Claudia

• WORKS GREAT Claudia.

Thanks a lot !

• by the way, Claudia, how can I use Python (like your script) to actually modify the .txt file. Because, for now, Python only show in the console the results of some function from the script. But how can I use Python script to search and replace something in the .txt files?

• @Vasile-Caraus

if you want to dive into python first thing, of course, is to get some basic knowledge of the language it self.
Note, the plugin uses python2 NOT 3 (there are differences, nothing too critical but those can be confusing
if you start learning the language and you try to do something which works in py3 but not in py2).

Next the help pages which come with the plugin itself.
Plugins->Python Script->Context-Help

And last but not least Scintillas help at http://www.scintilla.org/ScintillaDoc.html to get a better
understanding how the editor works.

The console is a good starting point to test things first.
In order to get all functions, attributes of a py object you can use the dir command.
So, if you do the following in the console you will get the list of functions of this object

``````dir(editor)
``````

I prefer to have not to scroll sideways so I use

``````print '\n'.join(dir(editor))
``````

In order to see what the parameters of a function are use the help command like

``````help(editor.insertText)
``````

Next if you search the forum you will find many scripts to solve some particular issues
one of my first posts answered a question to unit conversion

and finally, ask the question here if you have a specifc question.

Cheers
Claudia

Ahh… I would suggest to do the following changes in notepad
Settings->Preferences->Language check the “replace by space” because
Python don’t like it if you use tabs and spaces for indentation.

• @Claudia-Frank

Regarding print ‘\n’.join(dir(editor))

I don’t think that ‘print’ outputs to the Pythonscript console window by default.

From the following in the original startup.py:

# This sets the stdout to be the currently active document, so print “hello world”,
# will insert “hello world” at the current cursor position of the current document
sys.stdout = editor

This is of dubious value, especially since a ‘print’ used in this way inserts the text specified plus a UNIX-style line ending into your current file (which likely has Windows-style line endings!).

I, and likely also Claudia, have changed this line in startup.py to be:

sys.stdout = console

thus changing ‘print’ statements to output their data to the Pythonscript console (great for debugging your scripts!)

As alluded to above, the Pythonscript console seems to use UNIX-style line endings. I found this out in an odd way. If you copy-and-paste from the console to an editing window with Windows line endings, the line-endings on the source text will be changed at the time of the paste to match the destination file format, so all is good. HOWEVER, what I did one time was to paste via the “Clipboard History” window. This action seems to preserve the original UNIX-style line endings at the destination! I was quite confused as to why I had inconsistent line-endings in my document, until I figured it out.

• @Scott-Sumner

Scott, you are absolutely correct, I’ve changed this in startup.py
and for me this is much more convenient than using console.write to
print chars to the console.
Just a side not, the command
print ‘\n’.join(dir(editor))
should have been executed in the console itself and there it is working
but if some would use it in a script, than it would print to editor unless
you do changes Scott mentioned.

Thx for the info about copy/paste - I do this a lot but luckily I didn’t use the history ;-)

Cheers
Claudia