Regex: select/match the numbers that are repeated most often

Jim Dailey

@Vasile-Caraus
It isn’t entirely clear what you mean by “match”. Do you simply want to know which 4 numbers appear most often or something else altogether?

If you know how to do this in Excel, then if I were you, I would import the document into Excel and do whatever it is you want done.

Otherwise, you should probably use a scripting language (AWK, PERL, Python, etc.). This doesn’t sound like a task that regex is best suited to do.

guy038

Hello, Vasile,

I tried to guess, first, what you wanted to achieve and after getting random numbers from Net, I spent some hours, from time to time, to imagine a method ! And, luckily, I succeeded to find a solution, with the help of the Random.org site, which allows you to obtain the most frequent integers used, in a table of 10,000 integers maximum, with value between 1 and 9999 maximum

On the Random.org site, the value or random numbers can be, in the range ±1,000,000,000, but, due to some necessary regexes, I preferred to limit this range, between 1 and 9999

As your table of numbers contains 15 rows of 7 columns, the total number of integers, with value between 1 and 50 , is 105

So, go, first, to the Random.org site, from the address, below :

https://www.random.org/integers/

I typed ( in red colour ) the following answers :

Generate 105 random integers (maximum 10,000).
Each integer should have a value between 1 and 50 (both inclusive; limits ±1,000,000,000).
Format in 7 column(s).
Note: The numbers are generated left to right !

And I clicked on the Get Numbers button

I got a 15 x 7 table of 105 random integers, below, that I copied/pasted in a new tab, in N++

2	27	7	11	32	6	7
8	45	50	19	37	40	47
21	11	50	46	50	27	49
41	13	36	3	37	29	23
25	22	47	3	37	2	29
8	48	29	46	24	18	9
46	8	24	19	5	22	27
29	26	44	47	22	22	5
22	25	35	47	48	24	3
10	20	28	49	7	24	3
37	27	4	40	44	45	14
4	44	15	43	46	32	7
47	15	11	17	16	42	8
28	44	43	24	17	8	5
32	27	11	1	35	28	29

In that outputed list, the integers are separated with a single tabulation character. As I intended to sort these values, I needed, first, to put all the values, in a one column table.

Moreover, it was necessary to use a template, with possible leading zeros, in order to sort, later, these integers, correctly ! So :

A one-digit integer, was changed into the integer 000#
A two-digits integer, was changed into the integer 00##
A three-digits integer, was changed into the integer 0###
A four-digits integer, was changed into the integer ####

The regex S/R, which can realized these two goals, was :

SEARCH ^(\d(\d(\d(\d)?)?)?)(?:\t|\R)

REPLACE (?2:0)(?3:0)(?4:0)\1\r\n

After clicking, ONCE, on the Replace All button, I got the list, of 105 integers, below :

Using the menu option Edit > Line Operations > Sort Lines Lexicographically Ascending, I obtained the sorted text, below :

Then, I found a regex, in order to put all the same numbers, in an unique line. For instance, the four numbers 0003, in four consecutive lines, were displayed, after replacement, in the single line 0003 0003 0003 0003. So :

SEARCH (\d{4})\R\1

REPLACE \1 \1 , with a space character, between the two back-references, \1

IMPORTANT : You must click, TWICE, on the Replace All button, in order to end this S/R

REMARK :

If each number occurs ONCE or TWICE, only, in the current random list, you may, already, get the message : Replace All: 0 occurrences were replaced, while clicking a second time, on the Replace All button !

Thus, after TWO clicks on the Replace All button, that list was changed into this new one, below :

0001
0002 0002
0003 0003 0003 0003
0004 0004
0005 0005 0005
0006
0007 0007 0007 0007
0008 0008 0008 0008 0008
0009
0010
0011 0011 0011 0011
0013
0014
0015 0015
0016
0017 0017
0018
0019 0019
0020
0021
0022 0022 0022 0022 0022
0023
0024 0024 0024 0024 0024
0025 0025
0026
0027 0027 0027 0027 0027
0028 0028 0028
0029 0029 0029 0029 0029
0032 0032 0032
0035 0035
0036
0037 0037 0037 0037
0040 0040
0041
0042
0043 0043
0044 0044 0044 0044
0045 0045
0046 0046 0046 0046
0047 0047 0047 0047 0047
0048 0048
0049 0049
0050 0050 0050

Finally, I had to get rid of all the numbers, which were present, less than four times ! Indeed, only the integers, repeated, at least, four times, in that list, seemed useful. The suitable S/R to do so, is :

SEARCH ^(?!(\d{4})( \1){3}).+\R

REPLACE EMPTY

NOTE :

The general regex ^(?!(\d{4})( \1){N}).+\R, delete all the lines, where current number is present, between 1 and N times, maximum. So :
- If N = 1, every number, present ONCE, in the list, will be deleted
- If N = 2, every number, present ONCE or TWICE, in the list, will be deleted
- If N = 3, every number, present ONCE, TWICE or THREE times, in the list, will be deleted
- If N = 4, every number, present, between ONCE and FOUR times, in the list, will be deleted
- And so on…

After clicking ONCE, on the Replace All button, I got the final text, below :

0003 0003 0003 0003
0007 0007 0007 0007
0008 0008 0008 0008 0008
0011 0011 0011 0011
0022 0022 0022 0022 0022
0024 0024 0024 0024 0024
0027 0027 0027 0027 0027
0029 0029 0029 0029 0029
0037 0037 0037 0037
0044 0044 0044 0044
0046 0046 0046 0046
0047 0047 0047 0047 0047

Finally, from this text, it’s quite obvious to deduce that the more frequent numbers, in that random list of 105 numbers, are the six integers 8, 22, 24, 27, 29 and 47, which are present five times :-))

A second example :

I will not give details about it. I’ll just give the original random list of integers and the final list of the most frequent integers found

Let’s suppose a list of 300 integers, with values from 1 to 150, placed in 15 rows of 20 columns, each, below :

56	142	24	68	122	132	35	127	56	29	119	97	3	143	21	72	138	109	18	124
51	42	144	5	100	39	60	12	101	94	16	118	108	61	29	125	150	67	60	57
22	82	148	9	29	111	138	123	108	130	47	1	141	75	107	124	58	24	47	46
121	78	107	51	92	21	114	75	105	62	114	7	89	77	63	39	21	131	126	107
50	13	85	26	33	103	112	74	122	62	11	86	22	90	53	143	74	122	26	109
96	128	148	85	3	18	88	132	90	86	150	118	80	20	41	147	91	6	3	45
143	139	145	52	150	111	132	73	86	30	125	28	66	24	61	41	76	108	16	51
138	78	50	52	125	88	11	145	13	25	111	15	103	124	94	2	1	80	74	6
58	14	78	6	27	39	75	117	69	98	53	1	71	11	60	15	21	115	129	2
10	147	8	45	20	90	41	29	3	101	44	116	52	39	141	132	102	33	57	110
21	43	16	33	51	59	78	116	116	23	50	18	114	106	8	93	96	25	6	71
6	31	58	49	114	91	17	9	30	99	113	137	16	131	29	102	40	133	34	147
98	7	81	127	136	132	126	69	48	5	54	128	94	85	11	134	71	92	108	37
54	121	118	65	124	58	122	130	67	77	26	65	136	14	149	146	117	54	60	20
147	103	28	129	32	94	139	111	122	74	146	86	83	100	75	100	48	48	99	112

At the end, after the third regex S/R , you should get the final text, below :

0003 0003 0003 0003
0006 0006 0006 0006 0006
0011 0011 0011 0011
0016 0016 0016 0016
0021 0021 0021 0021 0021
0029 0029 0029 0029 0029
0039 0039 0039 0039
0051 0051 0051 0051
0058 0058 0058 0058
0060 0060 0060 0060
0074 0074 0074 0074
0075 0075 0075 0075
0078 0078 0078 0078
0086 0086 0086 0086
0094 0094 0094 0094
0108 0108 0108 0108
0111 0111 0111 0111
0114 0114 0114 0114
0122 0122 0122 0122 0122
0124 0124 0124 0124
0132 0132 0132 0132 0132
0147 0147 0147 0147

Now, not difficult to see that the more frequent numbers, in that random list of 300 numbers, between 1 and 150, are the five integers 6, 21, 29, 122 and 132, which are present five times :-))

A third example ( without explanations, just try ! )

Let’s suppose a list of 100 integers, with values from 1 to 999, placed in 10 rows of 10 columns, each, below :

591	132	551	647	337	570	610	427	281	868
266	424	760	306	46	262	239	178	11	752
236	97	50	415	237	198	444	63	77	602
189	562	36	334	822	704	759	242	651	306
39	998	172	606	973	846	854	687	759	304
865	50	5	583	685	888	510	468	742	144
612	948	538	802	531	657	300	779	817	392
227	231	984	466	670	203	852	879	164	775
362	211	981	675	889	273	86	184	485	643
180	390	690	292	906	902	245	933	679	931

The last S/R is, even, useless, because the numbers are, mostly, present ONCE, only !

=> The most frequent numbers, in that random list of 100 numbers, between 1 and 999, are the three integers 50, 306 and 759, which are present two times !

A final example :

Let’s suppose a list of 1000 integers, with values from 1 to 30, placed in 50 rows of 20 columns, each, below :

14	3	10	12	28	16	19	10	3	25	2	14	8	8	27	8	1	20	27	13
25	30	5	13	25	8	9	29	4	7	19	7	13	18	18	23	25	8	15	4
7	17	15	27	17	1	19	12	5	22	7	18	2	20	11	6	22	26	2	20
22	20	8	27	26	26	6	29	19	22	17	12	22	7	27	1	16	24	3	29
26	7	9	16	2	8	3	11	5	17	4	20	2	5	16	11	17	7	2	1
15	20	11	11	5	11	18	24	3	10	2	30	29	23	17	21	14	12	5	11
27	10	16	2	15	22	26	8	12	21	18	16	4	2	5	27	18	28	17	3
10	2	27	4	20	19	14	11	18	16	29	2	11	7	1	29	29	6	18	26
26	10	30	21	6	10	7	6	30	27	2	5	25	25	22	24	17	8	16	21
13	27	16	19	16	21	28	23	30	24	12	24	5	30	14	5	21	2	22	11
20	2	19	21	29	23	21	8	21	15	26	22	28	22	13	27	1	6	14	7
11	20	3	17	9	4	9	5	7	18	21	20	11	14	21	22	6	29	22	21
21	25	7	20	28	18	1	30	4	25	28	10	24	23	8	9	17	24	6	11
21	10	28	24	1	24	29	8	7	28	1	14	10	23	14	12	28	30	21	11
13	11	3	18	30	15	2	13	29	14	22	17	30	16	17	9	24	8	11	23
29	7	21	3	25	23	17	28	25	30	26	19	25	29	6	15	20	9	30	17
23	26	30	16	5	21	22	13	24	24	16	27	24	5	1	28	25	26	21	11
9	5	3	23	19	3	7	30	3	9	25	29	12	3	14	19	23	25	26	20
6	9	14	15	12	27	2	2	27	28	23	25	13	1	13	16	24	10	28	6
5	8	5	6	24	20	22	15	9	6	19	26	27	15	15	21	12	24	27	9
22	5	18	18	23	25	20	7	9	7	21	21	24	19	21	1	7	14	20	8
5	7	23	3	26	10	8	27	26	3	5	2	27	15	29	2	28	18	5	19
19	18	14	26	15	23	2	18	4	7	5	30	5	9	8	17	27	2	24	21
21	27	11	25	20	5	28	4	26	3	9	13	4	22	26	4	30	9	13	14
24	29	11	6	26	20	30	1	2	11	2	7	20	10	3	26	4	3	4	27
26	30	4	9	13	9	15	28	23	1	10	1	3	30	27	29	4	28	11	8
3	1	27	23	30	30	6	14	15	28	7	29	24	8	23	8	4	15	24	10
17	18	27	19	17	29	25	7	5	8	21	22	24	8	15	16	10	29	7	12
1	18	19	3	22	1	13	16	26	27	4	3	16	30	7	13	14	8	28	4
17	10	8	11	6	8	13	13	27	19	14	21	28	26	26	20	26	5	30	14
22	23	9	28	11	21	12	3	11	7	26	16	14	4	20	24	15	12	13	4
12	24	8	9	25	1	29	5	24	24	13	1	5	26	14	19	12	27	19	17
12	14	7	6	3	26	24	11	19	1	1	2	3	13	19	8	18	14	3	13
29	25	14	30	12	22	14	14	20	12	2	2	13	26	7	28	12	26	2	13
13	23	22	6	11	1	25	23	12	18	24	1	10	17	23	4	28	14	6	13
27	7	25	2	25	27	12	14	10	7	8	9	19	1	19	14	10	29	17	5
9	8	30	12	25	16	3	14	26	30	7	27	2	15	3	28	4	11	6	2
28	13	3	14	15	18	22	11	18	30	19	6	24	30	22	14	8	29	2	13
27	2	1	8	23	24	5	1	1	24	23	17	6	25	17	2	16	26	19	13
18	22	21	27	10	13	7	27	4	8	30	15	11	3	27	26	22	22	5	17
14	28	27	14	11	2	14	8	26	4	2	28	4	25	29	10	16	23	6	10
21	23	4	19	25	13	4	26	8	3	27	2	19	2	30	8	25	1	1	4
8	15	19	19	25	4	7	7	21	13	24	21	26	13	14	22	6	9	10	26
7	29	25	17	11	4	8	30	26	6	5	8	23	16	13	23	17	2	21	4
24	4	13	25	12	12	13	16	19	11	19	11	30	6	19	7	12	10	18	14
1	7	20	19	28	1	28	6	7	9	21	7	11	9	10	7	1	16	27	20
27	16	30	21	23	25	25	5	22	13	15	27	26	22	4	28	13	25	18	29
7	5	25	19	28	19	20	18	10	1	30	24	13	13	29	16	8	8	15	25
7	20	12	18	9	9	17	13	19	18	29	9	14	3	20	29	28	18	21	19
18	21	4	15	20	7	20	24	6	27	3	10	27	14	15	7	4	22	7	17

For the last S/R, I chose N = 38, because there are, only, 30 possible values and most numbers are, therefore, present, very often !

Hence, the last regex S/R is :

SEARCH ^(?!(\d{4})( \1){38}).+\R

REPLACE EMPTY

=> The most frequent numbers, in that random list of 1000 numbers, between 1 and 30, are the six integers, below :

7 ( present 45 times ), 8 and 13 ( present 40 times ), 14 and 26 ( present 39 times ) and 27 ( present 41 times ) !

Best Regards,

guy038

Vasile Caraus

hello Guy38. I must say…I never thing about this method.

But, you are the best.

Thanks A LOT ! WORKS !

Vasile Caraus

BUT, the only problem is that works on your exemples. Not at mine.

the \R from your regular expressions can be replace with other formula?

Vasile Caraus

This post is deleted!

Vasile Caraus

@guy038 said:

SEARCH ^(\d(\d(\d(\d)?)?)?)(?:\t|\R)
REPLACE (?2:0)(?3:0)(?4:0)\1\r\n

this regex of your ^(\d(\d(\d(\d)?)?)?)(?:\t|\R) doesn’t work at my place. The first one and the most important. The other regex works fine.

But I find another way to do this. Suppose I have:

17 25 30 37 38 47
2 6 7 17 30 42
3 17 20 38 44 45
4 5 6 30 36 42

Search: (Leave a single space)
Replace by: \r

then

Search: ^(a*) This will move the cursor at the beginning of each line
Replace by: 00

and I will get something like this:

0017
0025
0030
0037
0038
0047
002
006
007
0017
0030
0042
003
0017
0020
0038
0044
0045
004
005
006
0030
0036
0042

Vasile Caraus

@guy038 said:

SEARCH (\d{4})\R\1

REPLACE \1 \1 , with a space character, between the two back-references, \1

This, again, is not working at my place. (\d{4})\R\1 And I press many time “Replace All” button

Claudia Frank

@Vasile-Caraus

I know you are a regex fan but just to give you an idea how a python script
would look like to solve such a problem

from collections import Counter

x = editor.getText().replace('\r\n',' ').split(' ')  # get the list of numbers
y = [y for y in x if y !='']                         # get rid of the empty ones
counted_list = Counter(y)                            # create a list of tuples, counting each
for item in counted_list.most_common(4):             # iterate over the top 4
    console.write('{}\n'.format(item))               # and print it to the console

I used the list of 1000 integer @guy038 posted.
The result in the console would be

(‘7’, 45)
(‘27’, 41)
(‘8’, 40)
(‘13’, 40)

Meaning that number 7 occurred 45 times

Cheers
Claudia

Vasile Caraus

@Claudia-Frank said:

n idea how a pytho

hello Claudia, I don’t know Phyton, so I really don’t know what to do with the phyton script you write above.

guy038

Hello Claudia,

I’ve just tested, your Python solution, changing for the six most common used numbers, with the counted_list.most_common(6) expression and it just return all the numbers that I’ve had previously found, for the 1000 random integers list :-)

How elegant a Python ( or Lua, I suppose ) script is, compared to my complicated regex’s cooking !!!

Cheers,

guy038

Vasile Caraus

Claudia and guy038, please tell me how to use this python script !

Vasile Caraus

a short tutorial for this example will be great !

Claudia Frank

@Vasile-Caraus

What needs to be done first is described here.

Just in case that you haven’t installed python script plugin yet, I would propose to use the MSI package instead of using the plugin manager.

Short version, once python script plugin has been installed goto
Plugins->Python Script->New Script
give it a name and press save.
A new empty editor should appear.
Copy the content into it and save it.
Do NOT reformat the code as python is strict about whitespaces.

Open the python script console by clicking on
Plugins->Python Script->Show Console

Open your file with the numbers and run the script by clicking on
Plugins->Python Script->Scripts->NAME_OF_YOUR_SCRIPT
Cheers
Claudia

Vasile Caraus

WORKS GREAT Claudia.

Thanks a lot !

Vasile Caraus

by the way, Claudia, how can I use Python (like your script) to actually modify the .txt file. Because, for now, Python only show in the console the results of some function from the script. But how can I use Python script to search and replace something in the .txt files?

Claudia Frank

@Vasile-Caraus

if you want to dive into python first thing, of course, is to get some basic knowledge of the language it self.
Either use one of the youtube videos or if you prefer to read https://www.python.org/about/gettingstarted/.
Note, the plugin uses python2 NOT 3 (there are differences, nothing too critical but those can be confusing
if you start learning the language and you try to do something which works in py3 but not in py2).

Next the help pages which come with the plugin itself.
Plugins->Python Script->Context-Help

And last but not least Scintillas help at http://www.scintilla.org/ScintillaDoc.html to get a better
understanding how the editor works.

The console is a good starting point to test things first.
In order to get all functions, attributes of a py object you can use the dir command.
So, if you do the following in the console you will get the list of functions of this object

dir(editor)

I prefer to have not to scroll sideways so I use

print '\n'.join(dir(editor))

In order to see what the parameters of a function are use the help command like

help(editor.insertText)

Next if you search the forum you will find many scripts to solve some particular issues
one of my first posts answered a question to unit conversion
https://notepad-plus-plus.org/community/topic/10966/unit-conversion-plugin/13

and finally, ask the question here if you have a specifc question.

Cheers
Claudia

Ahh… I would suggest to do the following changes in notepad
Settings->Preferences->Language check the “replace by space” because
Python don’t like it if you use tabs and spaces for indentation.

Scott Sumner

@Claudia-Frank

Regarding print ‘\n’.join(dir(editor))

I don’t think that ‘print’ outputs to the Pythonscript console window by default.

From the following in the original startup.py:

# This sets the stdout to be the currently active document, so print “hello world”,
# will insert “hello world” at the current cursor position of the current document
sys.stdout = editor

This is of dubious value, especially since a ‘print’ used in this way inserts the text specified plus a UNIX-style line ending into your current file (which likely has Windows-style line endings!).

I, and likely also Claudia, have changed this line in startup.py to be:

sys.stdout = console

thus changing ‘print’ statements to output their data to the Pythonscript console (great for debugging your scripts!)

As alluded to above, the Pythonscript console seems to use UNIX-style line endings. I found this out in an odd way. If you copy-and-paste from the console to an editing window with Windows line endings, the line-endings on the source text will be changed at the time of the paste to match the destination file format, so all is good. HOWEVER, what I did one time was to paste via the “Clipboard History” window. This action seems to preserve the original UNIX-style line endings at the destination! I was quite confused as to why I had inconsistent line-endings in my document, until I figured it out.

Claudia Frank

@Scott-Sumner

Scott, you are absolutely correct, I’ve changed this in startup.py
and for me this is much more convenient than using console.write to
print chars to the console.
Just a side not, the command
print ‘\n’.join(dir(editor))
should have been executed in the console itself and there it is working
but if some would use it in a script, than it would print to editor unless
you do changes Scott mentioned.

Thx for the info about copy/paste - I do this a lot but luckily I didn’t use the history ;-)

Cheers
Claudia