How to bookmark only the first occurrences of multiple search results?

Claudia Frank

Hello Viktória,

First let me make clear that this script is not using regular expression at all.
It just takes the keywords as strings and tries to find it in the text.

Concerning the 2nd script, I guess, now I got it.:
It should find each keyword, in a loop, until a sentences is returned which
has not been returned yet.

Which is this.

# loop over the keywords
for word in keywords:
    # and find each first position
    position = editor.findText(FINDOPTION.WHOLEWORD, 0, end_position, word)
    # if we found the position of the keyword
    while position is not None:
        # check if the line hasn't been added yet
        _new_line = editor.getLine(editor.lineFromPosition(position[0]))
        # get new line
        if _new_line not in new_file_content:
            # append it to the new file content
            new_file_content += _new_line
            break
        position = editor.findText(FINDOPTION.WHOLEWORD, position[1]+1, end_position, word)

Sometimes I do not see the wood because of the trees. :-)

Cheers
Claudia

Viktoria Ontapado

@Claudia-Frank

Bingo!
My default encoding is UTF-8 without BOM so I’m not sure what was the reason for this alteration but your guess was right, the keywords-file was in UTF-8-BOM.

I converted it to UTF-8 without BOM and now it solved the issue, emptly line no longer needed for the process to working flawlessly, thank you!

My explanation worked well as well because your modifed 2nd script does exactly what I was looking for, I’m obliged. (I’m happy anyway for that initial misunderstanding because this way I can have 5 scripts with different tasks.:-)

Finally, to clear up something:

We changed some lines due to this UTF-stuff like:
console.write('word:{}\nlength:{}\n'.format(word.encode('utf-8'),len(word.encode('utf-8')))) and
_keywords = [line.strip() for line in f if len(line.strip()) > 0]

in the scriptbase.

Now that we figured out the BOM-issue, can I return to the initial script and its variants or for safety’s sake should I rather keep the version with these modified lines? What do you suggest?

Claudia Frank

@Viktoria-Ontapado

use the current (modified) version because

the _keywords = [line.strip() …

is needed, don’t change it.

The two lines starting with console.write can be deleted if you wish but I would keep it
and comment it, (right click on the line and use context menu to comment) because
if one day something doesn’t work you just can uncomment it again and you do have
some simple debugging, which could help to find out what the cause is.

Glad to see, that it finally works and I hope you can benefit from it.

Cheers
Claudia

Viktoria Ontapado

@Claudia-Frank

All right, I follow your suggestions.

Again, I’d like to say a big thanks to you from the bottom of my heart, you provided an invaluable help for me, I’m beyond grateful.

Have a nice week,
Viktória

guy038

Hi, @viktoria-ontapado,

Sorry, but, the last two days, I was far away from my beloved laptop :-D ( Actually, it’s quite an antiquated machine !! )

I’m pleased, Viktoria, that @claudia-frank succeeded to create your four customized Python scripts. She did great work, indeed :-))

From your problem, it’s quite easy to understand that dealing with scripts is much more powerful than running a couple of regexes !

However, just for fun, I tried to imagine how to solve your case #2, with regexes ! So, in a new tab, copy the 8 regexes, corresponding to the eight keywords, followed with your 20 sentences example :

(?i-s)(?!.+#)(\bstecken\b)(?s)(.*)
(?i-s)(?!.+#)(\bbesuchen\b)(?s)(.*)
(?i-s)(?!.+#)(\bdie Antwort\b)(?s)(.*)
(?i-s)(?!.+#)(\bfertig\b)(?s)(.*)
(?i-s)(?!.+#)(\bzuletzt\b)(?s)(.*)
(?i-s)(?!.+#)(\bdie Polizei\b)(?s)(.*)
(?i-s)(?!.+#)(\bdas Glück\b)(?s)(.*)
(?i-s)(?!.+#)(\bauch\b)(?s)(.*)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sie stecken in Schwierigkeiten.
Komm mich besuchen.
Stecken Sie Ihre Waffe ins Halfter!
Das Glück war das die Polizei die Antwort kannte.
Die Antwort gefällt mir.
Ich bin jetzt fertig.
Er lachte zuletzt.
Das Glück hat ihn verlassen, die Polizei verfolgt ihn.
Wann hast du sie zuletzt gesehen?
Ich liebe dich.
Ich bin auch siebzehn.
Ich bin auch achtzehn.
Rufen Sie die Polizei!
Ich bin auch zwanzig.
Wann kann ich dich besuchen?
Das Glück war ihm hold.
Mach das fertig.
Das Glück ist nicht so launenhaft.
Steht mir dieses Kleid?
Ich esse.

Select the first regex ( (?i-s)(?!.+#)(\bstecken\b)(?s)(.*) )
Open the Replace dialog ( Ctrl+ H )
Inside the replacement zone, type in the regex \1#\2
Click on the Replace All button

=> Only one replacement is performed : A # symbol is added, right after the word stecken

Now, select the second regex (?i-s)(?!.+#)(\bbesuchen\b)(?s)(.*)
To UPDATE the Replace dialog, which is unfocused, just use, again the Ctrl + H shortcut ! ( Nice trick, indeed ! )
Click, again, on the Replace All button

=> This time, a # symbol is added at the end of the word besuchen, in the second sentence

Go on, selecting the third regex (?i-s)(?!.+#)(\bdie Antwort\b)(?s)(.*)

And so on …

Once the 8 regexes executed, you should get that text, below :

Sie stecken# in Schwierigkeiten.
Komm mich besuchen#.
Stecken Sie Ihre Waffe ins Halfter!
Das Glück war das die Polizei die Antwort# kannte.
Die Antwort gefällt mir.
Ich bin jetzt fertig#.
Er lachte zuletzt#.
Das Glück hat ihn verlassen, die Polizei# verfolgt ihn.
Wann hast du sie zuletzt gesehen?
Ich liebe dich.
Ich bin auch# siebzehn.
Ich bin auch achtzehn.
Rufen Sie die Polizei!
Ich bin auch zwanzig.
Wann kann ich dich besuchen?
Das Glück# war ihm hold.
Mach das fertig.
Das Glück ist nicht so launenhaft.
Steht mir dieses Kleid?
Ich esse.

=> The keyword matched, on each line, is easily visible, thanks to the # symbol, added to that keyword !

Finally, to delete all lines, which does NOT contain a # symbol, as well as the symbol, itself, perform the S/R :

SEARCH ^[^#\r\n]+\R|#

REPLACE Empty

Sie stecken in Schwierigkeiten.
Komm mich besuchen.
Das Glück war das die Polizei die Antwort kannte.
Ich bin jetzt fertig.
Er lachte zuletzt.
Das Glück hat ihn verlassen, die Polizei verfolgt ihn.
Ich bin auch siebzehn.
Das Glück war ihm hold.

Et voilà !

Notes :

The general regex is (?i-s)(?!.+#)(\bKeyWord\b)(?s)(.*)
As usual, the modifiers (?i-s) forces a search in an insensitive way and tell the regex engine that the dot matches a single standard character
The part (?!.+#) is a negative look-ahead, which means that an overall match implies that a # symbol cannot be found, further, on the current line
If so, the regex (\bKeyWord\b) looks for the exact word “KeyWord”, stored as group 1, due to the parentheses
Then, the modifier (?s) implies that, from now on, the dot matches any single character, even End of line characters
Finally, the part (.*) stores, as group 2, all the text, after the current keyword, till the very end of the file
In replacement, the current keyword \1 is rewritten, followed by a # character, followed, itself, by the remaining of text \2

To end, @viktoria, I hope that you’re quite aware that the order of search of the different keywords may change the sentences found by the scripts !

Indeed, if, for instance, the string das Glück is searched, BEFORE the string die Polizei, the sentence Rufen Sie die Polizei! will be found and NOT the sentence Das Glück war ihm hold. !! Yeah, not easy to get all pieces of information, in one go :-((

Best Regards,

guy038

Viktoria Ontapado

@guy038

Very impressive, guy038, thank you very much. Though I have my beautiful scripts now thanks to Claudia, I worked through your regex-based solution. Along with your notes, so much can be learnt, really.

Take care,
Viktória