Community
    • Login

    How to bookmark only the first occurrences of multiple search results?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    26 Posts 3 Posters 16.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Viktoria OntapadoV
      Viktoria Ontapado
      last edited by

      Dear guy038, thank you for your reply.

      “Is it possible that some lines of your text contain more than one keyword?”
      Yes, there are lots of lines containing more than one keyword.

      “What is the approximative proportion”

      Though I wrote that my text contains approx. 100k lines and I have more than 100 keywords but to be perfectly honest I have multiple text files (every text file in a different language) and I hope I can use a regex in each of them.

      The kicker is that these files don’t have the same number of keywords and are of various length.

      So,
      I have a German file with 150k lines and 100 keywords.
      I have a Turkish one with 350k lines and 5000(!) keywords.
      An Italian with 250k lines and 600 keywords
      and so on…

      I’m not sure how can I calculate this approx. proportion for these files. Do you want me to search for each keyword in one of the text files? Because this way I can make a guess for the German file (because it has fewer keywords) but for the Turkish one it would take much more time and the proportion will be different in every file anyway.

      Is there a way I can calculate it not manually? Or if it’s allowed and possible, I can send you e.g. the Turkish file with all the lines and the 5000 keyword if you wish and it helps with the solution

      I hope I managed to express my observation regarding your second question. What do you suggest?

      Kind Regards and have a nice weekend,
      Viktória

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hi, @viktoria-ontapado,

        Oh no ! Don’t bother about that proportion value. You gave me some nice additional information, about your files, in your last post !

        But, it’s about midnight, in France, and… I’ll surely get more clear ideas, by tomorrow !

        Have a nice week-end, too !

        See you later,

        Cheers,

        guy038

        1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by guy038

          Hello, @viktoria-ontapado,

          Given your example, below, in a new N++ tab :

          Apples are delicious.
          I like turtles.
          He is tall.
          She is beautiful.
          Go to hell!
          Turtles are smart.
          These are the world’s most beautiful buildings.
          Apples are good for your health.
          The Hungarian flag is a horizontal tricolour of red, white and green.
          Turtles are reptiles.
          You are very clever.
          Hungarian is a difficult language.
          Bananas and apples are usually cheap.
          

          Just add, at the end, your list of keywords to search for ( apple, turtle and Hungarian ), inside the regex template (?i)KeyWord(?s).*, as below :

          (?i)turtle(?s).*
          (?i)apple(?s).*
          (?i)Hungarian(?s).*
          

          Now :

          • Go, towards the end, to the first keyword of your list

          • Select the entire regex (?i)turtle(?s).*

          • Open the Mark dialog ( Search > Mark… )

          • Check the Bookmark line and Wrap around options

          • Uncheck the Purge for each search option

          • Click on the Mark All button

          • Hit the Esc key to close the Mark dialog

          • Now, select the next keyword, by selecting the appropriate regex (?i)Keyword(?s).*

          • Open the Mark dialog ( Search > Mark… )

          • Click on the Mark All button

          • Hit the Esc key, to close the Mark dialog

          • Go on, till the last keyword, by moving back to the step, above, Now, select the .......

          …

          • At the end, simply, select the menu option Search > Bookmark > Remove Unmarked lines

          • Finally, delete any mark, with the option Search > Mark… > Clear all marks


          Obviously, it’s easy to understand that this manipulation is doable, for a small number of keywords, only ( I would say : not more than 50, at maximum ! ). Unfortunately, viktoria, as you spoke about files, containing 5000 keywords or so, you cannot do it, by hand :-((

          But, this problem could be easily solved by using a Python or Lua script ! And I’m quite sure that some people, whose @scott-sumner, @claudia-frank or @dail, will find the right script, very soon ;-)).


          So, the goal is :

          • Given a first file, containing some keywords

          • Given a second file, containing a list of regexes, built with the template (?i)Keyword(?s).*

          • Bookmark all occurrences of EACH regex of the second file, in the first file

          • Possibly, at the end, delete all the unmarked lines of the first file ( option Search > Bookmark > Remove Unmarked lines )

          Cheers,

          guy038

          P.S. :

          Viktoria, I suppose that the template (?i)\bKeyword\b(?s).* is a better regex, because, as it avoids the case of keywords, glued in a longer word !

          => The three regexes of your example, become :

          (?i)\bturtles?\b(?s).*
          (?i)\bapples?\b(?s).*
          (?i)\bHungarian\b(?s).*

          Viktoria OntapadoV 1 Reply Last reply Reply Quote 0
          • Claudia FrankC
            Claudia Frank
            last edited by

            @viktoria-ontapado

            in case you want to solve it using a scripted solution, what do you do
            with the bookmarked lines? Maybe this can be included as well.

            In regards to the files to check and the keywords,
            do they have some relation we could use?
            For example german_file1_to_check.txt and german_keywords.txt or similar?

            Cheers
            Claudia

            Viktoria OntapadoV 1 Reply Last reply Reply Quote 0
            • Viktoria OntapadoV
              Viktoria Ontapado @guy038
              last edited by Viktoria Ontapado

              Dear @guy038,

              Thank you very much for your detailed response, I can’t describe how grateful I am. Before I made this thread I
              ran through some topics and encountered your enormous knowledge and you delivered again.

              I’m much obliged, you’ve already done my work a great deal easier, thank you so much.

              1 Reply Last reply Reply Quote 0
              • Viktoria OntapadoV
                Viktoria Ontapado @Claudia Frank
                last edited by

                @Claudia-Frank

                Hello Claudia,
                Thank you for your assistance,.

                I’ve just want to copy the bookmarked lines from the TXTs and paste the content to other, new TXTs.

                As for your second point, I hope I don’t misunderstand you, basically do you mean, whether there is a correlation between the names of the files?

                The TXTs with the sentences have the following names.
                Czech_sentences
                Danish_sentences
                Dutch_sentences
                and so on…

                The ones with the keywords are:
                Czech_keywords
                Danish_keywords
                Dutch_keywords
                and so on…

                With regard to the first point, I’d like to name the new textfiles (with the copied bookmarked lines) as:
                Czech_keytences
                Danish_keytences
                Dutch_keytences
                and so on…

                I hope I answered your question.
                It’d be incredible to automatize the solution by guy038. I don’t know the chances of that because I have absolutely no knowledge of scripts and coding. Would it be a copy-paste solution basically from my point of view?

                Since I don’t know anything about the process, Should I give you the names of all the files (I have more than 50, I think) or the above-mentioned sequence is already enough?

                Thanks,
                Viktória

                Claudia FrankC 1 Reply Last reply Reply Quote 0
                • Claudia FrankC
                  Claudia Frank @Viktoria Ontapado
                  last edited by Claudia Frank

                  @Viktoria-Ontapado

                  yes, that are the answers I need.
                  One additional question, how does the keyword file look like?
                  I mean how are the words separated?
                  Maybe you wanna post, if possible, 2-3 lines?
                  Preferable german or english version ;-)

                  And yes, once I have the script it is basically a copy and paste action.
                  I will describe in detail, step by step, what you need to do.

                  In the meantime you could install the python script plugin by downloading
                  and install the msi package from here.
                  It is also available via the plugin manager but it has been reported that it doesn’t
                  install correctly too often, therefore I would recommend to use the msi installer instead.

                  Cheers
                  Claudia

                  Viktoria OntapadoV 1 Reply Last reply Reply Quote 0
                  • Viktoria OntapadoV
                    Viktoria Ontapado @Claudia Frank
                    last edited by

                    @Claudia-Frank

                    The words in the keyword files are separated exactly like the sentences in the sentence files, so there’s a linebreak (CR LF) after every item as in:
                    stecken
                    besuchen
                    die Antwort
                    fertig
                    zuletzt
                    die Polizei
                    das Glück
                    auch
                    .
                    .
                    .
                    Meanwhile I installed the plugin, thank you for the guidance.

                    One additional thing if you don’t mind me asking about this.
                    I don’t know at all how much work is needed for this kind of magic so maybe it’s inappropriate and excessive then forget about it.

                    I opened this topic because at the moment I’m working with these files where I only need the first occurence of bookmarked lines.
                    But in the future I’m planning to do a similar thing - with exactly the same sources - where I need every results/bookmarks for the keywords (where as I wrote in my initial post, the simple (apple|turtle|hungarian) regex will solve my marking problem.)

                    Is it possible that you can construct a script for this other option as well? So the new txt in the original example would contain all sentences with the word apple then all sentences with the word turtle then all sentences with the word hungarian etc.

                    If it’s too time consuming or complicated, please don’t waste your time, I just thought I ask about this as well because it’d be a tremendous help and as a layman my suspicion is that there’s not much difference between the two scripts but maybe I’m entirely wrong.

                    It’s 3 a.m. here so any further reply wil be a bit later.

                    Thank you for your help and good night,
                    Viktória

                    Claudia FrankC 1 Reply Last reply Reply Quote 0
                    • Claudia FrankC
                      Claudia Frank @Viktoria Ontapado
                      last edited by

                      @Viktoria-Ontapado

                      here the first version but first let’s check if python script plugin is correctly installed.

                      Goto Plugins->Python Script->Show Console

                      A new window with the content similar to

                      Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)]
                      Initialisation took 231ms
                      Ready.

                      should have opened.
                      Type

                      notepad.new() 
                      

                      into the textbox at the bottom of the newly opened window and press the run button next to it.

                      A new document should have opened and the console should show an additional message

                      >>> notepad.new()

                      If this is the case, go on, otherwise post the error from the console at the forum.

                      Next step, create the script.

                      Goto Plugins->Python Script->New Script

                      A window opens and ask for a name to be used to save the new file.
                      Just give it a meaningful name and save it - don’t change path etc…
                      Now a new document with that name has been opened.
                      Copy the following script into it and save the file.

                      # lines starting with the hash char are comments, like this line
                      # needed to be able to read the file as utf-8 encoded content
                      import codecs
                      
                      
                      def read_keyword_file(_file):
                          # create the keyword file name path
                          keyword_file_to_read = _file.replace('_sentences','_keywords')
                          # reset _keywords variable to prevent search with wrong keywords
                          _kewords = ''
                          # open the keyword file as utf-8 encoded file
                          with codecs.open(keyword_file_to_read, 'r', 'utf-8') as f:
                              # and read line by line and create the keyword list
                              _keywords = [line.strip() for line in f]
                              
                          # return the new created keyword list
                          return _keywords
                      
                      
                      # get the complete file path of current document
                      _file =  notepad.getCurrentFilename()
                      
                      # read and create the keyword word list from the proper keyword file
                      keywords = read_keyword_file(_file)
                      
                      # variable to store the content for the new file
                      new_file_content = ''
                      
                      # loop over the keywords
                      for word in keywords:
                          # and find each first position
                          position = editor.findText(0,0,editor.getTextLength(), word)
                          # if we found the position of the keyword
                          if position is not None:
                              # append it to the new file content
                              new_file_content += editor.getLine(editor.lineFromPosition(position[0]))
                      
                      # open a new document
                      notepad.new()
                      # and add the new content
                      editor.addText(new_file_content)
                      # save it in the same directory as the original sentences file but with keytences in it.
                      notepad.saveAs(_file.replace('_sentences','_keytences'))
                      
                      
                      console.write('{}\nScript finished!!\n'.format(_file))
                      

                      Done - that’s it for the script part.

                      Now how does it work?
                      It is assumed, that sentences files and keyword files are in the same directory.
                      You open a sentences file and goto Plugins->Python Script->Scripts and click the name of the script you saved.
                      If everything works as expected and new file should open, containing the sentences you are looking for.
                      If you still have the console open you should also see an additional message.
                      The name of the file and the statement that the script has finished.
                      Depending on the file size and keywords it might take a couple of seconds.
                      Check if the result is ok, if so go one with the next sentences file.
                      If not, let us know.

                      Word of warnings

                      1. if you run the script a second time on the same sentences file, the resulting new file will be overwritten.
                      2. It is assumed that the keyword file is utf-8 encoded. This is important if special chars like Ä or Ö (German Umlaute)
                        are used.
                      3. Always expect the unexpected - meaning save your work.

                      If you find that script useful you can do one further little addition.
                      Assign a shortcut/icon to execute the script.
                      Therefore goto Plugins->Python Script->Configuration and add your script to the menu and/or the toolbar - restart npp.
                      If you chosen add to toolbar a new icon should have appeared.
                      If you added it to the menu you can open the shortcut mapper (Settings->Shortcut Mapper), select the plugins tab,
                      search for your script and assign a shortcut.

                      I guess that’s it. Questions? Feel free to ask.

                      Concerning the second version, to find all occurrences of a keyword,
                      there is no problem to either adapt the same script with an question at
                      the beginning to ask which version you want to run or creating a second
                      standalone script. Let me know what you want to do.
                      A question which just came to my mind - what if multiple keywords are in the same line? Like you have a sentence

                      Das Glück war das die Polizei die Antwort kannte.
                      

                      Should this result in 3 new lines?

                      Good night.
                      Claudia

                      Viktoria OntapadoV 1 Reply Last reply Reply Quote 0
                      • Viktoria OntapadoV
                        Viktoria Ontapado @Claudia Frank
                        last edited by

                        @Claudia-Frank

                        Well, I couldn’t sleep yet so I can make another reply.
                        I really appreciate that you took the time and wrote down everything in a simple and detailed manner, it was immensely helpful.

                        I did everything as you wrote (I used the German sentence base with the above-mentioned 8 samplewords) and it’s working very well, thank you so much.


                        A question and an issue though:

                        1.
                        Should the first line of a keyword-file empty?
                        I noticed if the first line of the txt isn’t empty, the relevant sentence is missing. So regarding our sample, if the first line immediately starts with stecken than I only get 7 keytences and no sentence with the match “stecken”. If the first line is empty, than I got all 8 keytences. It’s a totally unimportant issue for me but if the first line should be empty, I will edit the relevant keyword-files accordingly.

                        2.
                        The 8th result is the following keytence: Ich brauche Eis.
                        If I understood the explanation of guy038 correctly , the template (?i)\bKeyword\b(?s).* should mean that the above result is wrong because we should get something with the word “auch”.


                        “Concerning the second version, to find all occurrences of a keyword,
                        there is no problem to either adapt the same script with an question at
                        the beginning to ask which version you want to run or creating a second
                        standalone script.”

                        I’d prefer a standalone script if it’s possible.

                        “A question which just came to my mind - what if multiple keywords are in the same line?”

                        Thank you so much for this question, really because I didn’t think about it earlier. Now I had time to think this through and to be perfectly honest, both option would be quite useful in the future.

                        So all in all, is it doable to make 4 standalone scripts by any chance?

                        1st
                        Results with only first occurences + in the case of “Das Glück war das die Polizei die Antwort kannte.” your example should be displayed in the keytence file 3 times. So if it’s the first bookmarked instance regarding the word das Glück, it’s displayed. If it’s also the first occurrence of die Polizie, it should be displayed again, etc.
                        So we get result like this one:

                        Sie stecken in Schwierigkeiten.
                        Komm mich besuchen.
                        Das Glück war das die Polizei die Antwort kannte.
                        Ich bin jetzt fertig.
                        Er lachte zuletzt.
                        Das Glück war das die Polizei die Antwort kannte.
                        Das Glück war das die Polizei die Antwort kannte.
                        Ich brauche Eis. -> Sentence with the exact word auch.

                        2nd
                        Results with only first occurences + in the case of “Das Glück war das die Polizei die Antwort kannte.” your example should be displayed in the keytence only 1 time.

                        Sie stecken in Schwierigkeiten.
                        Komm mich besuchen.
                        Das Glück war das die Polizei die Antwort kannte.
                        Ich bin jetzt fertig.
                        Er lachte zuletzt.
                        [The first occurrence of die Polizie is already displayed so it displays the second occurrence of “die Polizie”]
                        [The first and second occurrence of das Glück is already displayed so it displays the third occurrence of “das Glück”]
                        Ich brauche Eis. -> Sentence with the exact word auch.

                        3rd
                        Results with all bookmarked lines + “Das Glück war das die Polizei die Antwort kannte.” results 3 new lines.

                        4th
                        Results with all bookmarked lines + “Das Glück war das die Polizei die Antwort kannte.” result only 1 new line.


                        I hope I managed to write down everything carefully and errorless despite my tiredness.
                        Again, thank you for the invaluable help you are providing.

                        Now it’s time to get some sleep,
                        Viktória

                        1 Reply Last reply Reply Quote 0
                        • Viktoria OntapadoV
                          Viktoria Ontapado
                          last edited by

                          I knew I go wrong with something.

                          This line in my previous post (regarding the 2nd script)
                          [The first and second occurrence of das Glück is already displayed so it displays the third occurrence of “das Glück”]
                          should be:

                          [The first occurrence of das Glück is already displayed so it displays the second occurrence of “das Glück”]

                          1 Reply Last reply Reply Quote 0
                          • Claudia FrankC
                            Claudia Frank
                            last edited by

                            Just a quick response as I’m in hurry - still have to do a lot of work for my nephews birthday.

                            Issue with empty line - no, it wasn’t the intention, most probably a bug in script - will check.
                            Issue auch and brauche - correct, this is a bug, to be honest, I didn’t read the conversation
                            of guy and you completely so I missed that pitfall - will be fixed.

                            So all in all, is it doable to make 4 standalone scripts by any chance?

                            Yes - I will follow up on this either today evening or tomorrow morning, depends
                            what happens on the party ;-)

                            Cheers
                            Claudia

                            Viktoria OntapadoV 1 Reply Last reply Reply Quote 0
                            • Viktoria OntapadoV
                              Viktoria Ontapado @Claudia Frank
                              last edited by

                              @Claudia-Frank

                              Thank you, Claudia and have a good time!

                              Claudia FrankC 1 Reply Last reply Reply Quote 0
                              • Claudia FrankC
                                Claudia Frank @Viktoria Ontapado
                                last edited by

                                @Viktoria-Ontapado

                                I don’t see the reported issue about the first line being empty or not,
                                therefore I have add two debug statements that will print the information
                                to the python script console. Hopefully we do find out what’s going on.

                                Here the new script with the fix (whole word only)
                                Use it as template for the three other versions and replace the for loop
                                as shown below only.

                                Script 1 (complete code)
                                find first occurrence of each keyword even if same line has been already found

                                # needed to be able to read the file as utf-8 encoded content
                                import codecs
                                
                                def read_keyword_file(_file):
                                    # create the keyword file name path
                                    keyword_file_to_read = _file.replace('_sentences','_keywords')
                                    # reset _keywords variable to prevent search with wrong keywords
                                    _kewords = ''
                                    # open the keyword file as utf-8 encoded file
                                    with codecs.open(keyword_file_to_read, 'r', 'utf-8') as f:
                                        # and read line by line and create the keyword list
                                        _keywords = [line.strip() for line in f]
                                        
                                    # return the new created keyword list
                                    return _keywords
                                
                                
                                # get the complete file path of current document
                                _file = notepad.getCurrentFilename()
                                
                                # calculate end position (=text length)
                                end_position = editor.getTextLength()
                                
                                # read and create the keyword word list from the proper keyword file
                                keywords = read_keyword_file(_file)
                                
                                # variable to store the content for the new file
                                new_file_content = ''
                                
                                #replace code starting from here
                                # loop over the keywords
                                for word in keywords:
                                    console.write('word:{}\n'.format(word.encode('utf-8')))
                                    # and find each first position
                                    position = editor.findText(FINDOPTION.WHOLEWORD, 0, end_position, word)
                                    # if we found the position of the keyword
                                    if position is not None:
                                        console.write('{}\n'.format(editor.getLine(editor.lineFromPosition(position[0]))))
                                        # append it to the new file content
                                        new_file_content += editor.getLine(editor.lineFromPosition(position[0]))
                                #to here
                                
                                # open a new document
                                notepad.new()
                                # and add the new content
                                editor.addText(new_file_content)
                                # save it in the same directory as the original sentences file but with keytences in it.
                                notepad.saveAs(_file.replace('_sentences','_keytences'))
                                
                                
                                console.write('{}\nScript finished!!\n'.format(_file))
                                

                                As you see it contains two comments
                                #replace code starting from here
                                and
                                #to here

                                because, instead of posting the whole script 4 times, I post the changes of the three
                                other scripts only.

                                Script 2 (only code in loop for word in keywords changed)
                                find first occurrence of each keyword but do not count same line more than once

                                # loop over the keywords
                                for word in keywords:
                                    # and find each first position
                                    position = editor.findText(FINDOPTION.WHOLEWORD, 0, end_position, word)
                                    # if we found the position of the keyword
                                    if position is not None:
                                        # get new line
                                        _new_line = editor.getLine(editor.lineFromPosition(position[0]))
                                        # check if the line hasn't been added yet
                                        if _new_line not in new_file_content:
                                            # append it to the new file content
                                            new_file_content += _new_line
                                

                                Script 3 (only code in loop for word in keywords changed)
                                find every occurrence of each keyword even if same line has been already found

                                #loop over the keywords
                                for word in keywords:
                                    # and find each first position
                                    position = editor.findText(FINDOPTION.WHOLEWORD, 0, end_position, word)
                                    # if we found the position of the keyword
                                    while position is not None:
                                        # append it to the new file content
                                        new_file_content += editor.getLine(editor.lineFromPosition(position[0]))
                                        # find next position
                                        position = editor.findText(FINDOPTION.WHOLEWORD, position[1]+1, end_position, word)
                                

                                Script 4 (only code in loop for word in keywords changed)
                                find every occurrence of each keyword but do not count same line more than once

                                #loop over the keywords
                                for word in keywords:
                                    # and find each first position
                                    position = editor.findText(FINDOPTION.WHOLEWORD, 0, end_position, word)
                                    # if we found the position of the keyword
                                    while position is not None:
                                        # get new line
                                        _new_line = editor.getLine(editor.lineFromPosition(position[0]))
                                        # check if the line hasn't been added yet
                                        if _new_line not in new_file_content:
                                            # append it to the new file content
                                            new_file_content += _new_line
                                        position = editor.findText(FINDOPTION.WHOLEWORD, position[1]+1, end_position, word)
                                

                                Python, being the scripting language, is fussy about the indention.
                                So make sure that you keep the format when copy and replacing the code.

                                Cheers
                                Claudia

                                1 Reply Last reply Reply Quote 0
                                • Viktoria OntapadoV
                                  Viktoria Ontapado
                                  last edited by Viktoria Ontapado

                                  @Claudia-Frank

                                  Thank you for your efforts till now, I go through your provided scripts before getting some sleep.


                                  Two observations:

                                  1.
                                  The already mentioned first line matter is still present.
                                  So regarding our sample, if the keyword-txt doesn’t contain an empty line, I don’t get the results with the word ‘stecken’. This issue applies to all 4 scripts.
                                  Thanks to your debug statements I noticed a difference between the provided info on the console, so we’ve got some solid leads on this problem, I hope.

                                  Without empty line:

                                  word:stecken
                                  word:besuchen
                                  Come and see me. Komm mich besuchen.

                                  word:die Antwort
                                  The answer is 42. Die Antwort lautet zweiundvierzig.

                                  word:fertig
                                  That’s me. Ich bin jetzt fertig.

                                  word:zuletzt
                                  He had the last laugh. Er lachte zuletzt.

                                  word:die Polizei
                                  Call the police! Rufen Sie die Polizei!

                                  word:das Glück
                                  Luck is against me. Das Glück hat sich gegen mich gewendet.

                                  word:auch
                                  I’m 17, too. Ich bin auch siebzehn.

                                  C:\Users\user\Documents\mati\German_sentences.txt
                                  Script finished!!

                                  With empty line:

                                  word:
                                  word:stecken
                                  You’re in trouble. Sie stecken in Schwierigkeiten.

                                  word:besuchen
                                  Come and see me. Komm mich besuchen.

                                  word:die Antwort
                                  The answer is 42. Die Antwort lautet zweiundvierzig.

                                  word:fertig
                                  That’s me. Ich bin jetzt fertig.

                                  word:zuletzt
                                  He had the last laugh. Er lachte zuletzt.

                                  word:die Polizei
                                  Call the police! Rufen Sie die Polizei!

                                  word:das Glück
                                  Luck is against me. Das Glück hat sich gegen mich gewendet.

                                  word:auch
                                  I’m 17, too. Ich bin auch siebzehn.

                                  C:\Users\user\Documents\mati\German_sentences.txt
                                  Script finished!!

                                  2.
                                  The second script has a specific problem regarding our intent.
                                  At the moment without empty first line I get 5, with empty first line I get 6 keytences with our sample 8-keyword base.

                                  Working with your example sentence (Das Glück war das die Polizei die Antwort kannte.)
                                  that means when the script arrives at the keyword “die Polizei”, it simply omits the result, just like later with “das Glück”. (Because their first occurrences are already displayed as a match for ‘die Antwort’)

                                  I try to expand on my idea a bit again in case there is any misunderstanding.

                                  In the case of this 2nd script, we need the same number of keytences as in the 1st script. So the 2 missing keytences shouldn’t be omitted instead as I wrote in my previous reply:

                                  The second script behaviour should be:
                                  [The first occurrence of die Polizie is already displayed in a result for ‘die Antwort’ keyword so it displays the second occurrence of ‘die Polizie’]
                                  [The first occurrence of das Glück is already displayed in a result for ‘die Antwort’ keyword so it displays the second occurrence of ‘das Glück’]

                                  Or going further with this thought, if the sentence which has the second occurrence of ‘die Polizie’ [which is the first “pure next occurence” in the sense that it only contains the word ‘die Polizei’ without containing the word ‘die Antwort’] contains the word ‘das Glück’ as well (Example: Das Glück hat ihn verlassen, die Polizei verfolgt ihn.) then in the case of the subsequent keyword “das Glück”, the following rule should be applied:

                                  [The first occurrence of das Glück is already displayed - in a result for ‘die Antwort’ keyword; the second occurrence of ‘das Glück’ is already displayed- in a result for ‘die Polizei’ keyword; so now it displays the third occurrence of ‘das Glück’]

                                  I hope I managed to express my idea a bit more clearly and that i don’t messed up the detailed explanation.



                                  Apart from these two points, I didn’t notice any other issue during testing.

                                  Thank you,
                                  Viktória

                                  Claudia FrankC 1 Reply Last reply Reply Quote 0
                                  • Claudia FrankC
                                    Claudia Frank @Viktoria Ontapado
                                    last edited by

                                    @Viktoria-Ontapado

                                    Hello Viktória,

                                    Concerning the issue with the empty line I do see that this causes a problem
                                    but my observation is a little bit different as yours, because it looks like
                                    an empty line in keywords file causes to find the first line as match always.
                                    Meaning, the code line

                                    position = editor.findText(FINDOPTION.WHOLEWORD, 0, end_position, word)
                                    

                                    returns the position (0,0). ( I thought it returns None as if it doesn’t find anything)

                                    Could you please change the first console.write statement with the version below
                                    and redo the test. I expect the result is the same as before but I hope to see
                                    a difference in the console output.

                                    console.write('word:{}\nlength:{}\n'.format(word.encode('utf-8'),len(word.encode('utf-8'))))
                                    

                                    In order to get rid of the empty line issue I see, replace the
                                    keyword list creation code with this one

                                    _keywords = [line.strip() for line in f if len(line.strip()) > 0]
                                    

                                    Which basically means that an empty line doesn’t get added to the keyword list.

                                    Regarding script version 2 I’m confused as it sounds you expect the output
                                    to be the same as in script version 1. In that case there is no need for another version
                                    if it should result in the same output.

                                    If this isn’t your intention, could you do me a favor and create a sample
                                    sentences file together with a keywords file and the 4 expected keytences files
                                    and post it here?

                                    Cheers
                                    Claudia

                                    Viktoria OntapadoV 1 Reply Last reply Reply Quote 0
                                    • Viktoria OntapadoV
                                      Viktoria Ontapado @Claudia Frank
                                      last edited by Viktoria Ontapado

                                      @Claudia-Frank

                                      Hello,

                                      I modified your noted lines, here are the results:

                                      Without empty line:

                                      word:stecken
                                      length:10
                                      word:besuchen
                                      length:8
                                      Komm mich besuchen.

                                      word:die Antwort
                                      length:11
                                      Die Antwort lautet zweiundvierzig.

                                      word:fertig
                                      length:6
                                      Ich bin jetzt fertig.

                                      word:zuletzt
                                      length:7
                                      Er lachte zuletzt.

                                      word:die Polizei
                                      length:11
                                      Rufen Sie die Polizei!

                                      word:das Glück
                                      length:10
                                      Das Glück hat sich gegen mich gewendet.

                                      word:auch
                                      length:4
                                      Ich bin auch siebzehn.

                                      C:\Users\user\Documents\mati\German_sentences.txt
                                      Script finished!!

                                      With empty line:

                                      word:
                                      length:3
                                      word:stecken
                                      length:7
                                      Sie stecken in Schwierigkeiten.

                                      word:besuchen
                                      length:8
                                      Komm mich besuchen.

                                      word:die Antwort
                                      length:11
                                      Die Antwort lautet zweiundvierzig.

                                      word:fertig
                                      length:6
                                      Ich bin jetzt fertig.

                                      word:zuletzt
                                      length:7
                                      Er lachte zuletzt.

                                      word:die Polizei
                                      length:11
                                      Rufen Sie die Polizei!

                                      word:das Glück
                                      length:10
                                      Das Glück hat sich gegen mich gewendet.

                                      word:auch
                                      length:4
                                      Ich bin auch siebzehn.

                                      C:\Users\user\Documents\mati\German_sentences.txt
                                      Script finished!!


                                      Script1, script3 and script4 as far as I can see working superbly - excluding our empty line issue.
                                      Regarding the 2nd script, it’s working as well but not exactly in a way I’d like to, so at this time I try to illustrate my idea with a concrete example as you wished. I don’t know whether you can modify it accordingly, it’d be awesome though.

                                      As usual, I’m using our sample keyword-list with 8 item:
                                      (To ease my explanation, I assign a number one after another to every keyword):

                                      stecken (1)
                                      besuchen (2)
                                      die Antwort (3)
                                      fertig (4)
                                      zuletzt (5)
                                      die Polizei (6)
                                      das Glück (7)
                                      auch (8)


                                      I made a sample sentencelist with 20 short examples. So this is our German_sentences.txt now:

                                      Sie stecken in Schwierigkeiten.
                                      Komm mich besuchen.
                                      Stecken Sie Ihre Waffe ins Halfter!
                                      Das Glück war das die Polizei die Antwort kannte.
                                      Die Antwort gefällt mir.
                                      Ich bin jetzt fertig.
                                      Er lachte zuletzt.
                                      Das Glück hat ihn verlassen, die Polizei verfolgt ihn.
                                      Wann hast du sie zuletzt gesehen?
                                      Ich liebe dich.
                                      Ich bin auch siebzehn.
                                      Ich bin auch achtzehn.
                                      Rufen Sie die Polizei!
                                      Ich bin auch zwanzig.
                                      Wann kann ich dich besuchen?
                                      Das Glück war ihm hold.
                                      Mach das fertig.
                                      Das Glück ist nicht so launenhaft.
                                      Steht mir dieses Kleid?
                                      Ich esse.


                                      Your 2nd script will result the following keytences-file:

                                      Sie stecken in Schwierigkeiten. (1)
                                      Komm mich besuchen. (2)
                                      Das Glück war das die Polizei die Antwort kannte. (3)
                                      Ich bin jetzt fertig. (4)
                                      Er lachte zuletzt. (5)
                                      Ich bin auch siebzehn. (8)

                                      So the script found the 1st sentence with stecken; the 1st sentence with besuchen; the first sentence with die Antwort; the 1st sentence with fertig and the 1st sentence with zuletzt. Then it omitted the result with die Polizei because its first occurence was already displayed (3); then it omitted the result with das Glück because its first occurence was already displayed (3); then found the 1st result with auch.

                                      If it’s somehow achievable, what I’d like to get based on our sample sentencelist is this:

                                      Sie stecken in Schwierigkeiten. (1)
                                      Komm mich besuchen. (2)
                                      Das Glück war das die Polizei die Antwort kannte. (3)
                                      Ich bin jetzt fertig. (4)
                                      Er lachte zuletzt. (5)
                                      Das Glück hat ihn verlassen, die Polizei verfolgt ihn. (6)
                                      Das Glück war ihm hold. (7)
                                      Ich bin auch siebzehn. (8)

                                      So the script find the 1st sentence with stecken; the 1st sentence with besuchen; the first sentence with die Antwort; the 1st sentence with fertig and the 1st sentence with zuletzt.

                                      • Although it omits the first result with die Polizei because that first occurence was already displayed (3); but instead of returning 0 results, it jumps to the next occurrence so we now have a sentence with die Polizei as well.

                                      • Then the script omits the first result with das Glück (because its first occurence was already displayed (3).
                                        The next occurence in our example would be (6) but it should skip it as well because that sentence result belongs now to the keyword die Polizei and a given sentence/line shouldn’t belong to more than one keyword.
                                        So it jumps to the next occurrence, in this case, to the third instance of das Glück so we can have its sentence (7).

                                      (You noted that “I’m confused as it sounds you expect the output
                                      to be the same as in script version 1.”

                                      As you see, it’s not the case because for these three keywords [die Antwort, die Polizei, das Glück]

                                      1. Script1 will give these results:

                                      Das Glück war das die Polizei die Antwort kannte.
                                      Das Glück war das die Polizei die Antwort kannte.
                                      Das Glück war das die Polizei die Antwort kannte.

                                      1. Script2 now will give these results:

                                      Das Glück war das die Polizei die Antwort kannte.
                                      NO RESULT
                                      NO RESULT

                                      1. I wish Script2 would give these results:

                                      Das Glück war das die Polizei die Antwort kannte.
                                      Das Glück hat ihn verlassen, die Polizei verfolgt ihn.
                                      Das Glück war ihm hold.

                                      I’m quite sure that I’m the reason of the misunderstading and there was a misleading, ambiguous part in my earlier posts in this regard, sorry for that.

                                      I described my 2nd script earlier as:
                                      “Results with only first occurences + in the case of “Das Glück war das die Polizei die Antwort kannte.” your example should be displayed in the keytence only 1 time.”
                                      So what I meant by this is that to every keyword should belong the first, but previously not displayed occurrence somehow. So every keyword needs their own keytence which doesn’t contain a former keyword.

                                      Probably the problem is that the script is based on the template (?i)\bKeyword\b(?s).* so for start it only selects the first bookmarked occurrences so it doesn’t have a chance to jump to a next instance?

                                      Did I managed to explain it properly?

                                      Claudia FrankC 1 Reply Last reply Reply Quote 0
                                      • Claudia FrankC
                                        Claudia Frank @Viktoria Ontapado
                                        last edited by

                                        @Viktoria-Ontapado

                                        quick reply in regards to empty line.
                                        As you see, it seems there are additional chars in front of word stecken.
                                        The correct length is 7 but it is reported as 10.
                                        So I assume your keyword file is not utf8 but maybe utf8 with BOM !???
                                        What does npp report in the status line (bottom line)?
                                        The second field on the right side.
                                        The right most (most right ??) field should be either INS or OVR.
                                        I’m interested in the field left to INS/OVR.

                                        Thx for the samples - I will have a look and come back on this.

                                        Cheers
                                        Claudia

                                        1 Reply Last reply Reply Quote 0
                                        • Claudia FrankC
                                          Claudia Frank
                                          last edited by

                                          Hello Viktória,

                                          First let me make clear that this script is not using regular expression at all.
                                          It just takes the keywords as strings and tries to find it in the text.

                                          Concerning the 2nd script, I guess, now I got it.:
                                          It should find each keyword, in a loop, until a sentences is returned which
                                          has not been returned yet.

                                          Which is this.

                                          # loop over the keywords
                                          for word in keywords:
                                              # and find each first position
                                              position = editor.findText(FINDOPTION.WHOLEWORD, 0, end_position, word)
                                              # if we found the position of the keyword
                                              while position is not None:
                                                  # check if the line hasn't been added yet
                                                  _new_line = editor.getLine(editor.lineFromPosition(position[0]))
                                                  # get new line
                                                  if _new_line not in new_file_content:
                                                      # append it to the new file content
                                                      new_file_content += _new_line
                                                      break
                                                  position = editor.findText(FINDOPTION.WHOLEWORD, position[1]+1, end_position, word)
                                          

                                          Sometimes I do not see the wood because of the trees. :-)

                                          Cheers
                                          Claudia

                                          Viktoria OntapadoV 1 Reply Last reply Reply Quote 0
                                          • Viktoria OntapadoV
                                            Viktoria Ontapado @Claudia Frank
                                            last edited by

                                            @Claudia-Frank

                                            Bingo!
                                            My default encoding is UTF-8 without BOM so I’m not sure what was the reason for this alteration but your guess was right, the keywords-file was in UTF-8-BOM.

                                            I converted it to UTF-8 without BOM and now it solved the issue, emptly line no longer needed for the process to working flawlessly, thank you!

                                            My explanation worked well as well because your modifed 2nd script does exactly what I was looking for, I’m obliged. (I’m happy anyway for that initial misunderstanding because this way I can have 5 scripts with different tasks.:-)


                                            Finally, to clear up something:

                                            We changed some lines due to this UTF-stuff like:
                                            console.write('word:{}\nlength:{}\n'.format(word.encode('utf-8'),len(word.encode('utf-8')))) and
                                            _keywords = [line.strip() for line in f if len(line.strip()) > 0]

                                            in the scriptbase.

                                            Now that we figured out the BOM-issue, can I return to the initial script and its variants or for safety’s sake should I rather keep the version with these modified lines? What do you suggest?

                                            Claudia FrankC 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors