Community
    • Login

    Searching random duplicate numbers/values in Notepad++

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    15 Posts 5 Posters 1.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M
      MariusGHub
      last edited by MariusGHub

      Dear reader,

      There’s an awesome application for Football Manager that lets you run AI-generated images for computer generated players.

      However this program comes with a bug, and I found a nice workaround for it to generate new faces for an existing config.ini with all the text

      I am using this for a while. However I noticed it would also have duplicate ID’s in the text. These ID are numbers/values that said players/staff etc in the game has.
      Now lets say the code is

      <record from="African/African1642" to="graphics/pictures/person/2001438633/portrait"/>	
      <record from="Scandinavian/Scandinavian1363" to="graphics/pictures/person/2000650194/portrait"/>
      <record from="South American/South American2712" to="graphics/pictures/person/2000677450/portrait"/>
      <record from="African/African384" to="graphics/pictures/person/2000679316/portrait"/>
      <record from="African/African8860" to="graphics/pictures/person/2000679751/portrait"/>
      <record from="Scandinavian/Scandinavian3147" to="graphics/pictures/person/2000679938/portrait"/>
      <record from="Scandinavian/Scandinavian2945" to="graphics/pictures/person/2000680076/portrait"/>
      <record from="Scandinavian/Scandinavian1778" to="graphics/pictures/person/2000680138/portrait"/>
      <record from="Asian/Asian1230" to="graphics/pictures/person/2000683034/portrait"/>
      <record from="Italmed/ItalMed909" to="graphics/pictures/person/2000683836/portrait"/>
      

      For example 2000686319 is an ID code. I’ve now 1033239 lines and probably a quarter or less of them likely has duplicate ID’s. With a different “from=”. The game picks whenever line is first and adjusts said generated image to the player. So it just works ingame. However, What I want is to remove the duplicates & tidy up the file, and also help other users tidying up their config.ini.

      Thank you for reading and I hope to learn a lot about this. I’ve been trying for 2 hours now, haha. I am stubborn.

      I tried everything I could think of. Rexeg stuff I found over the internet, the normal notepad++ stuff to find duplicates, etc. Nothing works.

      Thank your for reading,
      Marius

      ps. If this question is not allowed here or posted in the wrong subforum. Please notify me.

      Mark OlsonM 1 Reply Last reply Reply Quote 0
      • Mark OlsonM
        Mark Olson @MariusGHub
        last edited by Mark Olson

        @MariusGHub
        If you had a smaller file, regular expressions would work for this. Because your file is so big, a regex-based solution would be abysmally slow (in technical terms, because duplicate-finding regexes scale as O(N^2), where N is the length of the file).

        But it’s not hard to make a PythonScript script (follow link for help on installation, running) that would solve your issues.

        # https://community.notepad-plus-plus.org/topic/24942/searching-random-duplicate-numbers-values-in-notepad
        # what the script does: extracts an integer id from a record of the form <record from="Scandinavian/Scandinavian1363" to="graphics/pictures/person/2000650194/portrait"/>
        # and eliminates every line that does not have the first instance of its id.
        # EXAMPLE INPUT:
        # <record from="blah1" to="graphics/pictures/person/2/portrait"/>
        # <record from="blah2" to="graphics/pictures/person/1/portrait"/>
        # <record from="blah3" to="graphics/pictures/person/1/portrait"/>
        # <record from="blah4" to="graphics/pictures/person/3/portrait"/>
        # <record from="blah5" to="graphics/pictures/person/7/portrait"/>
        # <record from="blah6" to="graphics/pictures/person/3/portrait"/>
        # OUTPUT FOR THAT INPUT:
        # <record from="blah1" to="graphics/pictures/person/2/portrait"/>
        # <record from="blah2" to="graphics/pictures/person/1/portrait"/>
        # <record from="blah4" to="graphics/pictures/person/3/portrait"/>
        # <record from="blah5" to="graphics/pictures/person/7/portrait"/>
        
        def getFirstLineForEachDistinctId(text):
            lines = text.split('\r\n')
            first_line_distinct_id = []
            distinct_ids = set()
            for ii, line in enumerate(lines):
                # extract integer if it is followed by '"/portrait"/>' and then end of line
                id_ = re.findall('(\d+)/portrait"/>$', line)[0]
                if id_ not in distinct_ids:
                    distinct_ids.add(id_)
                    first_line_distinct_id.append(line)
            newtext = '\r\n'.join(first_line_distinct_id)
            editor.setText(newtext)
            
        if __name__ == '__main__':
            getFirstLineForEachDistinctId(editor.getText())
        

        I tested this on a file with 1 million lines and 200,000 distinct values and it executed in about a second.

        M 2 Replies Last reply Reply Quote 3
        • Mark OlsonM Mark Olson referenced this topic on
        • M
          MariusGHub @Mark Olson
          last edited by

          Thank you so much. I will take a look into it within half a hour cause I need to complete a race in an online league.
          Curious how it works! I know phyton a bit from back in days when I was working at Honorbuddy profiles.

          1 Reply Last reply Reply Quote 0
          • M
            MariusGHub @Mark Olson
            last edited by MariusGHub

            I sadly get this error. I looked up the FAQ a couple of times, and retried it a couple of times too. I assume line 31 and 23 are the problem.
            Edit: I did read on stackoverflow I’ve to ‘import re’?

                id_ = re.findall('(\d+)/portrait"/>$', line)[0]
            NameError: global name 're' is not defined
            Traceback (most recent call last):
              File "C:\Users\name\AppData\Roaming\Notepad++\plugins\Config\PythonScript\scripts\getFirstLineForEachDistinctId.py", line 31, in <module>
                getFirstLineForEachDistinctId(editor.getText())
              File "C:\Users\name\AppData\Roaming\Notepad++\plugins\Config\PythonScript\scripts\getFirstLineForEachDistinctId.py", line 23, in getFirstLineForEachDistinctId
                id_ = re.findall('(\d+)/portrait"/>$', line)[0]
            NameError: global name 're' is not defined
            

            Another one I got is after removing some text thats not the portrait line-stuff.

                id_ = re.findall('(\d+)/portrait"/>$', line)[0]
            IndexError: list index out of range
            Traceback (most recent call last):
              File "C:\Users\name\AppData\Roaming\Notepad++\plugins\Config\PythonScript\scripts\getFirstLineForEachDistinctId.py", line 31, in <module>
                getFirstLineForEachDistinctId(editor.getText())
              File "C:\Users\name\AppData\Roaming\Notepad++\plugins\Config\PythonScript\scripts\getFirstLineForEachDistinctId.py", line 23, in getFirstLineForEachDistinctId
                id_ = re.findall('(\d+)/portrait"/>$', line)[0]
            IndexError: list index out of range
            

            Looking up on the internet, hehe. So much to understand and it all feels like alien-language a bit. :D

            Mark OlsonM 1 Reply Last reply Reply Quote 0
            • Mark OlsonM
              Mark Olson @MariusGHub
              last edited by Mark Olson

              @MariusGHub
              Sorry about that!
              Here’s a revised version.

              # https://community.notepad-plus-plus.org/topic/24942/searching-random-duplicate-numbers-values-in-notepad
              # what the script does: extracts an integer id from a record of the form <record from="Scandinavian/Scandinavian1363" to="graphics/pictures/person/2000650194/portrait"/>
              # and eliminates every line that does not have the first instance of its id.
              # EXAMPLE INPUT:
              # <record from="blah1" to="graphics/pictures/person/2/portrait"/>
              # <record from="blah2" to="graphics/pictures/person/1/portrait"/>
              # <record from="blah3" to="graphics/pictures/person/1/portrait"/>
              # <record from="blah4" to="graphics/pictures/person/3/portrait"/>
              # <record from="blah5" to="graphics/pictures/person/7/portrait"/>
              # <record from="blah6" to="graphics/pictures/person/3/portrait"/>
              # OUTPUT FOR THAT INPUT:
              # <record from="blah1" to="graphics/pictures/person/2/portrait"/>
              # <record from="blah2" to="graphics/pictures/person/1/portrait"/>
              # <record from="blah4" to="graphics/pictures/person/3/portrait"/>
              # <record from="blah5" to="graphics/pictures/person/7/portrait"/>
              from Npp import *
              import re
              
              def getFirstLineForEachDistinctId(text):
                  lines = text.split('\r\n')
                  first_line_distinct_id = []
                  distinct_ids = set()
                  for ii, line in enumerate(lines):
                      try:
                          id_ = re.findall('(\d+)/portrait"/>$', line)[0]
                          if id_ not in distinct_ids:
                              distinct_ids.add(id_)
                              first_line_distinct_id.append(line)
                      except:
                          first_line_distinct_id.append(line) # keep non-matching lines
                  newtext = '\r\n'.join(first_line_distinct_id)
                  editor.setText(newtext)
                  
              if __name__ == '__main__':
                  getFirstLineForEachDistinctId(editor.getText())
              
              M PeterJonesP 2 Replies Last reply Reply Quote 1
              • Mark OlsonM Mark Olson referenced this topic on
              • M
                MariusGHub @Mark Olson
                last edited by

                Thank you so very much. It worked within 1 second. I had 170 000 duplicates. I will immediately notify other users about my workaround and then your python-script. Will credit you ofcourse. I hope it will help lots of Football Manager users tidying up their Newgan config.ini!

                Thank you again so much. Do you accept donations?

                1 Reply Last reply Reply Quote 0
                • guy038G
                  guy038
                  last edited by

                  Hello, @mariusghub, @mark-olson and All,

                  Of course, Mark, in your script, we can change the line :

                              id_ = re.findall('(\d+)/portrait"/>$', line)[0]
                  

                  with this one, for the same results :

                              id_ = re.findall('/(\d+)/', line)[0]
                  

                  On the other hand, if you want to delete any duplicate line, this simple syntax seems to work :

                              id_ = re.findall('.+', line)[0]
                  

                  Now, I wanted to get some explanations on how to manage the python flags, in order to restrict the search, but I could not get any on-line help on the re.findall command or even findall. Could you enlight me, somehow, about it ? TIA !

                  Best Regards,

                  guy038

                  Alan KilbornA 1 Reply Last reply Reply Quote 1
                  • Alan KilbornA
                    Alan Kilborn @guy038
                    last edited by Alan Kilborn

                    @guy038 said in Searching random duplicate numbers/values in Notepad++:

                    but I could not get any on-line help on the re.findall command

                    Try: https://docs.python.org/3/library/re.html and then press Ctrl+f and search for findall. Or, since you are interested in the flags, maybe search for Flag constants are now instances.

                    But…I’m reluctant to provide this information as this really isn’t on-topic for this forum. Unless…you’re going to launch into some comparison of these flags versus some attribute of the Boost engine. Not sure where you’re going – not a mind reader – but hopefully you will be staying on-topic.

                    Actually…I’d say the script provided isn’t all that on-topic either. Ok, well, it DOES use a couple of editor functions, but really this is a thin veil (to read the entire text, and to write out a revised entire text) just to keep it a PythonScript. Really it is just a plain old Python program. It could just as well have been written in another language outside of Notepad+±- and I don’t think we want to get into writing specific-purpose non-Notepad++ related programs for people on this forum. If I were responding to the OP, I’d have said – short and sweet – “Notepad++ can’t help you with this; you’re going to have to do some programming”, and left it at that.

                    M 1 Reply Last reply Reply Quote 0
                    • M
                      MariusGHub @Alan Kilborn
                      last edited by

                      Dear Alan,

                      I am very grateful Mark helped me with this issue. I was already trying for about nearly 6 hours with regex-commands. However, Mark pointed out my file was so big I had to use Python. I wouldn’t have a clue I could use Python and would been trying to use regex until I gave up.

                      I do believe also with the commands guy08 provided I learned that the code can be even shorter. And I also learned how this code works and what it does, and will likely help me in the future too. In the end I think this can benefit other future Notepad+±users with similar long files and want to lookup or delete duplicates.

                      For Football Manager I believe this tidying up even helped with the performance. Which is awesome of course!

                      It’s that we all live in this world and try to help each other. Give that push in the back and then start to learn marvelous things?

                      Kind regards,
                      Marius

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by

                        Hello, @alan-kilborn, @mark-olson, @mariusghub and All,

                        Alan, thanks for your explanations. If I understand you correctly, this means that the @mark-olson’s Python text is a true Python program, whose all the stuff can still be interpreted as a Python Script, from within Notepad++ ?

                        This certainly explains why I could not find any documentation on the findall function, using only the Plugins > Python Script > Context-Help option !


                        Now, changing the line :

                                    id_ = re.findall('.+', line)[0]
                        

                        by this one :

                                    id_ = re.findall('.+', line, re.I)[0]
                        

                        on this text :

                        123
                        Test
                        test
                        TEST
                        Test
                        Test
                        test
                        Test
                        789
                        

                        Still wrongly returns :

                        
                        123
                        Test
                        test
                        TEST
                        789
                        

                        Instead of the right result :

                        123
                        Test
                        789
                        

                        Am I missing something obvious ?

                        BR

                        guy038

                        Alan KilbornA 2 Replies Last reply Reply Quote 0
                        • Alan KilbornA
                          Alan Kilborn @guy038
                          last edited by

                          @guy038 said in Searching random duplicate numbers/values in Notepad++:

                          this means that the @mark-olson’s Python text is a true Python program

                          No, not exactly. Mark’s script uses two editor object functions, which are specific to PythonScript and would generate errors if trying to run as a standalone Python program.

                          But as these editor object functions are merely grabbing all of the text from a Notepad++ tab, and setting all of the text from a Notepad++ tab later, they are standins for simple file reading/writing operations of standard Python.

                          I’m not going to harp on it too much, but it would be better to see PythonScript programs presented here focus on innovated tasks that PS can do for you in Notepad++.

                          1 Reply Last reply Reply Quote 2
                          • Alan KilbornA
                            Alan Kilborn @guy038
                            last edited by

                            @guy038 said in Searching random duplicate numbers/values in Notepad++:

                            Am I missing something obvious ?

                            I’m being dragged kicking and screaming into off-topic land (but of course it is my choice to reply)… :-)

                            I could let Mark reply but this one is too easy:

                            The code in question is searching a single line for all matches, in a loop. Since one line doesn’t know about another, from one call of findall to the next, of course the output can contain duplicates (with and without regard to case).

                            1 Reply Last reply Reply Quote 1
                            • PeterJonesP
                              PeterJones @Mark Olson
                              last edited by PeterJones

                              Trying to bring it back to more PythonScript specifics:

                              lines = text.split('\r\n')
                              first_line_distinct_id = []
                              distinct_ids = set()
                              for ii, line in enumerate(lines):
                              

                              If the function is just iterating through each line, why not use the editor.forEachLine(callback) syntax? It allows running a callback on each line in the file; the callback would then look at the line, and decide whether that line contains the first instance of an ID or not; you could then use editor.deleteLine() as shown in the example in the PythonScript documetation for forEachLine.

                              Or you could use the editor.research(regex, callback, ...), and have the regex find the full line containing the ID (and separate ID into group1); the callback can then track whether it’s found the particular ID yet or not: if it hasn’t found it, mark it as found and return the original string (so it keeps the first instance intact, because it replaces it with the same text); if it has been found, edit the string to remove that duplicate line (ie, return an empty string, so it replaces that line with an empty string)

                              Both of those implementation ideas would at least stay focused on features that are specific to PythonScript & Notepad++, rather than on generic python code.

                              Mark OlsonM 1 Reply Last reply Reply Quote 4
                              • Mark OlsonM
                                Mark Olson @PeterJones
                                last edited by Mark Olson

                                @PeterJones
                                I actually didn’t know about the editor.forEachLine callback. Had I known, my code could have been simplified.

                                I don’t particularly appreciate people (cf. AlanKilborn) nitpicking my solution, which was perfectly adequate for this problem.

                                But I figured I would also share another script I made for this general task, using an arbitrary regex instead of the specific one I made for this task:

                                # https://community.notepad-plus-plus.org/topic/24942/searching-random-duplicate-numbers-values-in-notepad/14?_=1695248025881
                                # finds all lines that match a certain regex (and have a subset of the line
                                # as capture group 1)
                                # and then:
                                # * removes lines that don't match the user-supplied regex
                                # * removes all lines except the first one for each value of the first capture group
                                # EXAMPLE INPUT:
                                # <record from="blah1" to="graphics/pictures/person/2/portrait"/>
                                # <record from="blah2" to="graphics/pictures/person/1/portrait"/>
                                # <record from="blah3" to="graphics/pictures/person/1/portrait"/>
                                # <record from="blah4" to="graphics/pictures/person/3/portrait"/>
                                # <record from="blah5" to="graphics/pictures/person/7/portrait"/>
                                # <record from="blah6" to="graphics/pictures/person/3/portrait"/>
                                # OUTPUT FOR THAT INPUT (WITH user-supplied regex "(\d+)/portrait"):
                                # <record from="blah1" to="graphics/pictures/person/2/portrait"/>
                                # <record from="blah2" to="graphics/pictures/person/1/portrait"/>
                                # <record from="blah4" to="graphics/pictures/person/3/portrait"/>
                                # <record from="blah5" to="graphics/pictures/person/7/portrait"/>
                                from Npp import *
                                
                                __version__ = '0.0.1'
                                
                                class REMDUPREGMATCH(object):
                                    def __init__(self):
                                        self.first_line_distinct_match = []
                                        self.distinct_matches = set()
                                        self.lines_matched = 0
                                        self.this_script_name = 'remove duplicate regex matches ' + __version__
                                        user_regex_with_group_1 = ''
                                        keep_non_matching_lines = True
                                        self.start_line_count = editor.getLineCount()
                                        def on_empty_line_match(mtch):
                                            self.start_line_count -= 1
                                        editor.research('^$', on_empty_line_match) # don't count empty lines
                                        eol = [ '\r\n', '\r', '\n' ][ editor.getEOLMode() ]
                                        while True:
                                            user_regex_with_group_1 = self.prompt('Enter regex with group 1 being the sort key.', user_regex_with_group_1)
                                            if user_regex_with_group_1 is None: return  # user cancel
                                            if not user_regex_with_group_1.strip():
                                                self.mb('Cannot specify an empty regex!  Try again.')
                                                continue
                                            if '(' not in user_regex_with_group_1:
                                                self.mb('Need to specify capture group 1 (as the sort key) in the regex!  Try again.')
                                                continue
                                            regex_err_msg = self.search_regex_is_invalid_error_msg(user_regex_with_group_1)
                                            if regex_err_msg:
                                                self.mb('Bad regular expression!\r\n\r\n{e}\r\n\r\nTry again.'.format(e=regex_err_msg))
                                                continue
                                            break
                                        # only get FIRST match on each line
                                        # also match BUT DON'T CAPTURE lines that don't match the regex
                                        # this allows us to preserve the non-matching lines
                                        user_regex_with_group_1 = '(?-s)^.*?%s.*$' % user_regex_with_group_1
                                        print('user_regex_with_group_1 =', user_regex_with_group_1)
                                        editor.research(user_regex_with_group_1, self.on_match)
                                        newtext = eol.join(self.first_line_distinct_match)
                                        editor.setText(newtext)
                                        if self.lines_matched != self.start_line_count:
                                            self.mb('Warning: %d matches were found but the document originally had %d non-empty lines' % (self.lines_matched, self.start_line_count))
                                
                                    def on_match(self, mtch):
                                        self.lines_matched += 1
                                        # print(mtch.groups())
                                        matchval = mtch.group(1)
                                        if matchval not in self.distinct_matches:
                                            self.distinct_matches.add(matchval)
                                            self.first_line_distinct_match.append(mtch.group(0))
                                
                                    def search_regex_is_invalid_error_msg(self, test_regex):
                                        try:
                                            # test test_regex for validity on a small subset of the document
                                            editor.research(test_regex, lambda _: None, 0, 0, 1000)
                                        except RuntimeError as r:
                                            return str(r)
                                        return ''
                                
                                    def mb(self, msg, flags=0, title=''):  # a message-box function
                                        return notepad.messageBox(msg, title if title else self.this_script_name, flags)
                                
                                    def prompt(self, prompt_text, default_text=''):
                                        if '\n' not in prompt_text: prompt_text = '\r\n' + prompt_text
                                        prompt_text += ':'
                                        return notepad.prompt(prompt_text, self.this_script_name, default_text)
                                
                                
                                if __name__ == '__main__':
                                    REMDUPREGMATCH()
                                
                                Alan KilbornA 1 Reply Last reply Reply Quote 2
                                • Alan KilbornA
                                  Alan Kilborn @Mark Olson
                                  last edited by Alan Kilborn

                                  @Mark-Olson said in Searching random duplicate numbers/values in Notepad++:

                                  I don’t particularly appreciate people (cf. AlanKilborn) nitpicking my solution, which was perfectly adequate for this problem.

                                  Too bad? We try to keep things “on track” here. If something devolves into “I’ll write what is effectively not a Notepad++ solution” here, it shouldn’t be here.

                                  1 Reply Last reply Reply Quote -2
                                  • First post
                                    Last post
                                  The Community of users of the Notepad++ text editor.
                                  Powered by NodeBB | Contributors