Regex: select/match the numbers that are repeated most often



  • Hello, Vasile,

    I tried to guess, first, what you wanted to achieve and after getting random numbers from Net, I spent some hours, from time to time, to imagine a method ! And, luckily, I succeeded to find a solution, with the help of the Random.org site, which allows you to obtain the most frequent integers used, in a table of 10,000 integers maximum, with value between 1 and 9999 maximum

    On the Random.org site, the value or random numbers can be, in the range ±1,000,000,000, but, due to some necessary regexes, I preferred to limit this range, between 1 and 9999

    As your table of numbers contains 15 rows of 7 columns, the total number of integers, with value between 1 and 50 , is 105

    So, go, first, to the Random.org site, from the address, below :

    https://www.random.org/integers/

    I typed ( in red colour ) the following answers :

    • Generate 105 random integers (maximum 10,000).

    • Each integer should have a value between 1 and 50 (both inclusive; limits ±1,000,000,000).

    • Format in 7 column(s).

    • Note: The numbers are generated left to right !

    And I clicked on the Get Numbers button


    I got a 15 x 7 table of 105 random integers, below, that I copied/pasted in a new tab, in N++

    2	27	7	11	32	6	7
    8	45	50	19	37	40	47
    21	11	50	46	50	27	49
    41	13	36	3	37	29	23
    25	22	47	3	37	2	29
    8	48	29	46	24	18	9
    46	8	24	19	5	22	27
    29	26	44	47	22	22	5
    22	25	35	47	48	24	3
    10	20	28	49	7	24	3
    37	27	4	40	44	45	14
    4	44	15	43	46	32	7
    47	15	11	17	16	42	8
    28	44	43	24	17	8	5
    32	27	11	1	35	28	29
    

    In that outputed list, the integers are separated with a single tabulation character. As I intended to sort these values, I needed, first, to put all the values, in a one column table.

    Moreover, it was necessary to use a template, with possible leading zeros, in order to sort, later, these integers, correctly ! So :

    • A one-digit integer, was changed into the integer 000#
    • A two-digits integer, was changed into the integer 00##
    • A three-digits integer, was changed into the integer 0###
    • A four-digits integer, was changed into the integer ####

    The regex S/R, which can realized these two goals, was :

    SEARCH ^(\d(\d(\d(\d)?)?)?)(?:\t|\R)

    REPLACE (?2:0)(?3:0)(?4:0)\1\r\n

    After clicking, ONCE, on the Replace All button, I got the list, of 105 integers, below :

    0002
    0027
    0007
    0011
    0032
    0006
    0007
    0008
    0045
    0050
    0019
    0037
    0040
    0047
    0021
    0011
    0050
    0046
    0050
    0027
    0049
    0041
    0013
    0036
    0003
    0037
    0029
    0023
    0025
    0022
    0047
    0003
    0037
    0002
    0029
    0008
    0048
    0029
    0046
    0024
    0018
    0009
    0046
    0008
    0024
    0019
    0005
    0022
    0027
    0029
    0026
    0044
    0047
    0022
    0022
    0005
    0022
    0025
    0035
    0047
    0048
    0024
    0003
    0010
    0020
    0028
    0049
    0007
    0024
    0003
    0037
    0027
    0004
    0040
    0044
    0045
    0014
    0004
    0044
    0015
    0043
    0046
    0032
    0007
    0047
    0015
    0011
    0017
    0016
    0042
    0008
    0028
    0044
    0043
    0024
    0017
    0008
    0005
    0032
    0027
    0011
    0001
    0035
    0028
    0029
    

    Using the menu option Edit > Line Operations > Sort Lines Lexicographically Ascending, I obtained the sorted text, below :

    0001
    0002
    0002
    0003
    0003
    0003
    0003
    0004
    0004
    0005
    0005
    0005
    0006
    0007
    0007
    0007
    0007
    0008
    0008
    0008
    0008
    0008
    0009
    0010
    0011
    0011
    0011
    0011
    0013
    0014
    0015
    0015
    0016
    0017
    0017
    0018
    0019
    0019
    0020
    0021
    0022
    0022
    0022
    0022
    0022
    0023
    0024
    0024
    0024
    0024
    0024
    0025
    0025
    0026
    0027
    0027
    0027
    0027
    0027
    0028
    0028
    0028
    0029
    0029
    0029
    0029
    0029
    0032
    0032
    0032
    0035
    0035
    0036
    0037
    0037
    0037
    0037
    0040
    0040
    0041
    0042
    0043
    0043
    0044
    0044
    0044
    0044
    0045
    0045
    0046
    0046
    0046
    0046
    0047
    0047
    0047
    0047
    0047
    0048
    0048
    0049
    0049
    0050
    0050
    0050
    

    Then, I found a regex, in order to put all the same numbers, in an unique line. For instance, the four numbers 0003, in four consecutive lines, were displayed, after replacement, in the single line 0003 0003 0003 0003. So :

    SEARCH (\d{4})\R\1

    REPLACE \1 \1 , with a space character, between the two back-references, \1

    IMPORTANT : You must click, TWICE, on the Replace All button, in order to end this S/R

    REMARK :

    • If each number occurs ONCE or TWICE, only, in the current random list, you may, already, get the message : Replace All: 0 occurrences were replaced, while clicking a second time, on the Replace All button !

    Thus, after TWO clicks on the Replace All button, that list was changed into this new one, below :

    0001
    0002 0002
    0003 0003 0003 0003
    0004 0004
    0005 0005 0005
    0006
    0007 0007 0007 0007
    0008 0008 0008 0008 0008
    0009
    0010
    0011 0011 0011 0011
    0013
    0014
    0015 0015
    0016
    0017 0017
    0018
    0019 0019
    0020
    0021
    0022 0022 0022 0022 0022
    0023
    0024 0024 0024 0024 0024
    0025 0025
    0026
    0027 0027 0027 0027 0027
    0028 0028 0028
    0029 0029 0029 0029 0029
    0032 0032 0032
    0035 0035
    0036
    0037 0037 0037 0037
    0040 0040
    0041
    0042
    0043 0043
    0044 0044 0044 0044
    0045 0045
    0046 0046 0046 0046
    0047 0047 0047 0047 0047
    0048 0048
    0049 0049
    0050 0050 0050
    

    Finally, I had to get rid of all the numbers, which were present, less than four times ! Indeed, only the integers, repeated, at least, four times, in that list, seemed useful. The suitable S/R to do so, is :

    SEARCH ^(?!(\d{4})( \1){3}).+\R

    REPLACE EMPTY

    NOTE :

    • The general regex ^(?!(\d{4})( \1){N}).+\R, delete all the lines, where current number is present, between 1 and N times, maximum. So :

      • If N = 1, every number, present ONCE, in the list, will be deleted
      • If N = 2, every number, present ONCE or TWICE, in the list, will be deleted
      • If N = 3, every number, present ONCE, TWICE or THREE times, in the list, will be deleted
      • If N = 4, every number, present, between ONCE and FOUR times, in the list, will be deleted
      • And so on…

    After clicking ONCE, on the Replace All button, I got the final text, below :

    0003 0003 0003 0003
    0007 0007 0007 0007
    0008 0008 0008 0008 0008
    0011 0011 0011 0011
    0022 0022 0022 0022 0022
    0024 0024 0024 0024 0024
    0027 0027 0027 0027 0027
    0029 0029 0029 0029 0029
    0037 0037 0037 0037
    0044 0044 0044 0044
    0046 0046 0046 0046
    0047 0047 0047 0047 0047
    

    Finally, from this text, it’s quite obvious to deduce that the more frequent numbers, in that random list of 105 numbers, are the six integers 8, 22, 24, 27, 29 and 47, which are present five times :-))


    A second example :

    I will not give details about it. I’ll just give the original random list of integers and the final list of the most frequent integers found

    Let’s suppose a list of 300 integers, with values from 1 to 150, placed in 15 rows of 20 columns, each, below :

    56	142	24	68	122	132	35	127	56	29	119	97	3	143	21	72	138	109	18	124
    51	42	144	5	100	39	60	12	101	94	16	118	108	61	29	125	150	67	60	57
    22	82	148	9	29	111	138	123	108	130	47	1	141	75	107	124	58	24	47	46
    121	78	107	51	92	21	114	75	105	62	114	7	89	77	63	39	21	131	126	107
    50	13	85	26	33	103	112	74	122	62	11	86	22	90	53	143	74	122	26	109
    96	128	148	85	3	18	88	132	90	86	150	118	80	20	41	147	91	6	3	45
    143	139	145	52	150	111	132	73	86	30	125	28	66	24	61	41	76	108	16	51
    138	78	50	52	125	88	11	145	13	25	111	15	103	124	94	2	1	80	74	6
    58	14	78	6	27	39	75	117	69	98	53	1	71	11	60	15	21	115	129	2
    10	147	8	45	20	90	41	29	3	101	44	116	52	39	141	132	102	33	57	110
    21	43	16	33	51	59	78	116	116	23	50	18	114	106	8	93	96	25	6	71
    6	31	58	49	114	91	17	9	30	99	113	137	16	131	29	102	40	133	34	147
    98	7	81	127	136	132	126	69	48	5	54	128	94	85	11	134	71	92	108	37
    54	121	118	65	124	58	122	130	67	77	26	65	136	14	149	146	117	54	60	20
    147	103	28	129	32	94	139	111	122	74	146	86	83	100	75	100	48	48	99	112
    

    At the end, after the third regex S/R , you should get the final text, below :

    0003 0003 0003 0003
    0006 0006 0006 0006 0006
    0011 0011 0011 0011
    0016 0016 0016 0016
    0021 0021 0021 0021 0021
    0029 0029 0029 0029 0029
    0039 0039 0039 0039
    0051 0051 0051 0051
    0058 0058 0058 0058
    0060 0060 0060 0060
    0074 0074 0074 0074
    0075 0075 0075 0075
    0078 0078 0078 0078
    0086 0086 0086 0086
    0094 0094 0094 0094
    0108 0108 0108 0108
    0111 0111 0111 0111
    0114 0114 0114 0114
    0122 0122 0122 0122 0122
    0124 0124 0124 0124
    0132 0132 0132 0132 0132
    0147 0147 0147 0147
    

    Now, not difficult to see that the more frequent numbers, in that random list of 300 numbers, between 1 and 150, are the five integers 6, 21, 29, 122 and 132, which are present five times :-))


    A third example ( without explanations, just try ! )

    Let’s suppose a list of 100 integers, with values from 1 to 999, placed in 10 rows of 10 columns, each, below :

    591	132	551	647	337	570	610	427	281	868
    266	424	760	306	46	262	239	178	11	752
    236	97	50	415	237	198	444	63	77	602
    189	562	36	334	822	704	759	242	651	306
    39	998	172	606	973	846	854	687	759	304
    865	50	5	583	685	888	510	468	742	144
    612	948	538	802	531	657	300	779	817	392
    227	231	984	466	670	203	852	879	164	775
    362	211	981	675	889	273	86	184	485	643
    180	390	690	292	906	902	245	933	679	931
    

    The last S/R is, even, useless, because the numbers are, mostly, present ONCE, only !

    => The most frequent numbers, in that random list of 100 numbers, between 1 and 999, are the three integers 50, 306 and 759, which are present two times !


    A final example :

    Let’s suppose a list of 1000 integers, with values from 1 to 30, placed in 50 rows of 20 columns, each, below :

    14	3	10	12	28	16	19	10	3	25	2	14	8	8	27	8	1	20	27	13
    25	30	5	13	25	8	9	29	4	7	19	7	13	18	18	23	25	8	15	4
    7	17	15	27	17	1	19	12	5	22	7	18	2	20	11	6	22	26	2	20
    22	20	8	27	26	26	6	29	19	22	17	12	22	7	27	1	16	24	3	29
    26	7	9	16	2	8	3	11	5	17	4	20	2	5	16	11	17	7	2	1
    15	20	11	11	5	11	18	24	3	10	2	30	29	23	17	21	14	12	5	11
    27	10	16	2	15	22	26	8	12	21	18	16	4	2	5	27	18	28	17	3
    10	2	27	4	20	19	14	11	18	16	29	2	11	7	1	29	29	6	18	26
    26	10	30	21	6	10	7	6	30	27	2	5	25	25	22	24	17	8	16	21
    13	27	16	19	16	21	28	23	30	24	12	24	5	30	14	5	21	2	22	11
    20	2	19	21	29	23	21	8	21	15	26	22	28	22	13	27	1	6	14	7
    11	20	3	17	9	4	9	5	7	18	21	20	11	14	21	22	6	29	22	21
    21	25	7	20	28	18	1	30	4	25	28	10	24	23	8	9	17	24	6	11
    21	10	28	24	1	24	29	8	7	28	1	14	10	23	14	12	28	30	21	11
    13	11	3	18	30	15	2	13	29	14	22	17	30	16	17	9	24	8	11	23
    29	7	21	3	25	23	17	28	25	30	26	19	25	29	6	15	20	9	30	17
    23	26	30	16	5	21	22	13	24	24	16	27	24	5	1	28	25	26	21	11
    9	5	3	23	19	3	7	30	3	9	25	29	12	3	14	19	23	25	26	20
    6	9	14	15	12	27	2	2	27	28	23	25	13	1	13	16	24	10	28	6
    5	8	5	6	24	20	22	15	9	6	19	26	27	15	15	21	12	24	27	9
    22	5	18	18	23	25	20	7	9	7	21	21	24	19	21	1	7	14	20	8
    5	7	23	3	26	10	8	27	26	3	5	2	27	15	29	2	28	18	5	19
    19	18	14	26	15	23	2	18	4	7	5	30	5	9	8	17	27	2	24	21
    21	27	11	25	20	5	28	4	26	3	9	13	4	22	26	4	30	9	13	14
    24	29	11	6	26	20	30	1	2	11	2	7	20	10	3	26	4	3	4	27
    26	30	4	9	13	9	15	28	23	1	10	1	3	30	27	29	4	28	11	8
    3	1	27	23	30	30	6	14	15	28	7	29	24	8	23	8	4	15	24	10
    17	18	27	19	17	29	25	7	5	8	21	22	24	8	15	16	10	29	7	12
    1	18	19	3	22	1	13	16	26	27	4	3	16	30	7	13	14	8	28	4
    17	10	8	11	6	8	13	13	27	19	14	21	28	26	26	20	26	5	30	14
    22	23	9	28	11	21	12	3	11	7	26	16	14	4	20	24	15	12	13	4
    12	24	8	9	25	1	29	5	24	24	13	1	5	26	14	19	12	27	19	17
    12	14	7	6	3	26	24	11	19	1	1	2	3	13	19	8	18	14	3	13
    29	25	14	30	12	22	14	14	20	12	2	2	13	26	7	28	12	26	2	13
    13	23	22	6	11	1	25	23	12	18	24	1	10	17	23	4	28	14	6	13
    27	7	25	2	25	27	12	14	10	7	8	9	19	1	19	14	10	29	17	5
    9	8	30	12	25	16	3	14	26	30	7	27	2	15	3	28	4	11	6	2
    28	13	3	14	15	18	22	11	18	30	19	6	24	30	22	14	8	29	2	13
    27	2	1	8	23	24	5	1	1	24	23	17	6	25	17	2	16	26	19	13
    18	22	21	27	10	13	7	27	4	8	30	15	11	3	27	26	22	22	5	17
    14	28	27	14	11	2	14	8	26	4	2	28	4	25	29	10	16	23	6	10
    21	23	4	19	25	13	4	26	8	3	27	2	19	2	30	8	25	1	1	4
    8	15	19	19	25	4	7	7	21	13	24	21	26	13	14	22	6	9	10	26
    7	29	25	17	11	4	8	30	26	6	5	8	23	16	13	23	17	2	21	4
    24	4	13	25	12	12	13	16	19	11	19	11	30	6	19	7	12	10	18	14
    1	7	20	19	28	1	28	6	7	9	21	7	11	9	10	7	1	16	27	20
    27	16	30	21	23	25	25	5	22	13	15	27	26	22	4	28	13	25	18	29
    7	5	25	19	28	19	20	18	10	1	30	24	13	13	29	16	8	8	15	25
    7	20	12	18	9	9	17	13	19	18	29	9	14	3	20	29	28	18	21	19
    18	21	4	15	20	7	20	24	6	27	3	10	27	14	15	7	4	22	7	17
    

    For the last S/R, I chose N = 38, because there are, only, 30 possible values and most numbers are, therefore, present, very often !

    Hence, the last regex S/R is :

    SEARCH ^(?!(\d{4})( \1){38}).+\R

    REPLACE EMPTY

    => The most frequent numbers, in that random list of 1000 numbers, between 1 and 30, are the six integers, below :

    7 ( present 45 times ), 8 and 13 ( present 40 times ), 14 and 26 ( present 39 times ) and 27 ( present 41 times ) !

    Best Regards,

    guy038



  • hello Guy38. I must say…I never thing about this method.

    But, you are the best.

    Thanks A LOT ! WORKS !



  • BUT, the only problem is that works on your exemples. Not at mine.

    the \R from your regular expressions can be replace with other formula?



  • This post is deleted!


  • @guy038 said:

    SEARCH ^(\d(\d(\d(\d)?)?)?)(?:\t|\R)
    REPLACE (?2:0)(?3:0)(?4:0)\1\r\n

    this regex of your ^(\d(\d(\d(\d)?)?)?)(?:\t|\R) doesn’t work at my place. The first one and the most important. The other regex works fine.

    But I find another way to do this. Suppose I have:

    17 25 30 37 38 47
    2 6 7 17 30 42
    3 17 20 38 44 45
    4 5 6 30 36 42

    Search: (Leave a single space)
    Replace by: \r

    then

    Search: ^(a*) This will move the cursor at the beginning of each line
    Replace by: 00

    and I will get something like this:

    0017
    0025
    0030
    0037
    0038
    0047
    002
    006
    007
    0017
    0030
    0042
    003
    0017
    0020
    0038
    0044
    0045
    004
    005
    006
    0030
    0036
    0042



  • @guy038 said:

    SEARCH (\d{4})\R\1

    REPLACE \1 \1 , with a space character, between the two back-references, \1

    This, again, is not working at my place. (\d{4})\R\1 And I press many time “Replace All” button



  • @Vasile-Caraus

    I know you are a regex fan but just to give you an idea how a python script
    would look like to solve such a problem

    from collections import Counter
    
    x = editor.getText().replace('\r\n',' ').split(' ')  # get the list of numbers
    y = [y for y in x if y !='']                         # get rid of the empty ones
    counted_list = Counter(y)                            # create a list of tuples, counting each
    for item in counted_list.most_common(4):             # iterate over the top 4
        console.write('{}\n'.format(item))               # and print it to the console
    

    I used the list of 1000 integer @guy038 posted.
    The result in the console would be

    (‘7’, 45)
    (‘27’, 41)
    (‘8’, 40)
    (‘13’, 40)

    Meaning that number 7 occurred 45 times

    Cheers
    Claudia



  • @Claudia-Frank said:

    n idea how a pytho

    hello Claudia, I don’t know Phyton, so I really don’t know what to do with the phyton script you write above.



  • Hello Claudia,

    I’ve just tested, your Python solution, changing for the six most common used numbers, with the counted_list.most_common(6) expression and it just return all the numbers that I’ve had previously found, for the 1000 random integers list :-)

    How elegant a Python ( or Lua, I suppose ) script is, compared to my complicated regex’s cooking !!!

    Cheers,

    guy038



  • Claudia and guy038, please tell me how to use this python script !



  • a short tutorial for this example will be great !



  • @Vasile-Caraus

    What needs to be done first is described here.

    Just in case that you haven’t installed python script plugin yet, I would propose to use the MSI package instead of using the plugin manager.

    Short version, once python script plugin has been installed goto
    Plugins->Python Script->New Script
    give it a name and press save.
    A new empty editor should appear.
    Copy the content into it and save it.
    Do NOT reformat the code as python is strict about whitespaces.

    Open the python script console by clicking on
    Plugins->Python Script->Show Console

    Open your file with the numbers and run the script by clicking on
    Plugins->Python Script->Scripts->NAME_OF_YOUR_SCRIPT
    Cheers
    Claudia



  • WORKS GREAT Claudia.

    Thanks a lot !



  • by the way, Claudia, how can I use Python (like your script) to actually modify the .txt file. Because, for now, Python only show in the console the results of some function from the script. But how can I use Python script to search and replace something in the .txt files?



  • @Vasile-Caraus

    if you want to dive into python first thing, of course, is to get some basic knowledge of the language it self.
    Either use one of the youtube videos or if you prefer to read https://www.python.org/about/gettingstarted/.
    Note, the plugin uses python2 NOT 3 (there are differences, nothing too critical but those can be confusing
    if you start learning the language and you try to do something which works in py3 but not in py2).

    Next the help pages which come with the plugin itself.
    Plugins->Python Script->Context-Help

    And last but not least Scintillas help at http://www.scintilla.org/ScintillaDoc.html to get a better
    understanding how the editor works.

    The console is a good starting point to test things first.
    In order to get all functions, attributes of a py object you can use the dir command.
    So, if you do the following in the console you will get the list of functions of this object

    dir(editor)
    

    I prefer to have not to scroll sideways so I use

    print '\n'.join(dir(editor))
    

    In order to see what the parameters of a function are use the help command like

    help(editor.insertText)   
    

    Next if you search the forum you will find many scripts to solve some particular issues
    one of my first posts answered a question to unit conversion
    https://notepad-plus-plus.org/community/topic/10966/unit-conversion-plugin/13

    and finally, ask the question here if you have a specifc question.

    Cheers
    Claudia

    Ahh… I would suggest to do the following changes in notepad
    Settings->Preferences->Language check the “replace by space” because
    Python don’t like it if you use tabs and spaces for indentation.



  • @Claudia-Frank

    Regarding print ‘\n’.join(dir(editor))

    I don’t think that ‘print’ outputs to the Pythonscript console window by default.

    From the following in the original startup.py:

    # This sets the stdout to be the currently active document, so print “hello world”,
    # will insert “hello world” at the current cursor position of the current document
    sys.stdout = editor

    This is of dubious value, especially since a ‘print’ used in this way inserts the text specified plus a UNIX-style line ending into your current file (which likely has Windows-style line endings!).

    I, and likely also Claudia, have changed this line in startup.py to be:

    sys.stdout = console

    thus changing ‘print’ statements to output their data to the Pythonscript console (great for debugging your scripts!)

    As alluded to above, the Pythonscript console seems to use UNIX-style line endings. I found this out in an odd way. If you copy-and-paste from the console to an editing window with Windows line endings, the line-endings on the source text will be changed at the time of the paste to match the destination file format, so all is good. HOWEVER, what I did one time was to paste via the “Clipboard History” window. This action seems to preserve the original UNIX-style line endings at the destination! I was quite confused as to why I had inconsistent line-endings in my document, until I figured it out.



  • @Scott-Sumner

    Scott, you are absolutely correct, I’ve changed this in startup.py
    and for me this is much more convenient than using console.write to
    print chars to the console.
    Just a side not, the command
    print ‘\n’.join(dir(editor))
    should have been executed in the console itself and there it is working
    but if some would use it in a script, than it would print to editor unless
    you do changes Scott mentioned.

    Thx for the info about copy/paste - I do this a lot but luckily I didn’t use the history ;-)

    Cheers
    Claudia


Log in to reply