Replacing Duped Words across a block block of text, respecting {}

Marc Lalonde

A Good example i found a bit further along the file in reference to the sometimes in quotes " "
male_names = { |"Barom Reachea"| "Noreay Ramathipatei" "Soriyotei" "Thommo Reachea" "Srey Sukonthor" "Ang Chan" "Reachea Ramathipatei" |"Barom Reachea"| "Chey Chettha" "Ney Khan" "Preah Ram" "Keo Hua" "Outey Reachea" "Dharmaraja" "Padumaraja" "Ramathipadi" "Satha" "Ream Reachea" |"Narayanaraja"| "Ponhea Yat" |"Narayanaraja"| "Sri Raja" "Rajadhiraja" "Dharmarajadhiraja" "Damkhat Sukonthor" "Reamea Chungprey" "Keo Ban On" "Ponhea Yor" "Ponhea An" "Ponhea Nhom" "Srei Soriyopear" "Udayaraja" "Ponhea To" "Ang Non" "Ponhea Chan" "Ang Sur" "Ang Chea" "Ang Nan" "Ang Sor" "Ang Yong" "Ang Em" "Ang Tham" "Ang Sngoun" "Ang Tong" "Ang Ton" "Ang Eng" "Ang Duong" }
All of the stuff in " " are all unique strings. In this case its spread across 6 lines, there is 2 Dupes strung about this block. (Marked with | on either side | for reading)

Also I wouldn’t mind if i had to say select each block 1 by one, i do have a external program that’s at least pointing out where each dupe is, but having to look by hand for each case is proving troublesome to keep focused

guy038

Hello, @marc-lalonde and All,

OK, Marc, I"ve already find out the right regex to match the gap between two duplicates names, of a same block !

Now, I need additional information :-)

Given, for instance, the text, below, with the Barom Reachea duplicate name :

male_names = { "Ang Chan" "Barom Reachea" "Soriyotei" "Barom Reachea" "Sri Raja" }

Do you expect, AFTER replacement, the text A :

male_names = { "Ang Chan" "Barom Reachea" "Soriyotei" "Sri Raja" }

OR the text B :

male_names = { "Ang Chan" "Soriyotei" "Barom Reachea" "Sri Raja" }

In addition, could you tell me which text cases may happen ? ( From your previous post, I understood that cases 1 and 2 do occur ! )

Case 1 : male_names = { "Ang Chan" "Barom Reachea" "Soriyotei" "Barom Reachea" "Sri Raja" }

Case 2 : male_names = { Ang Chan Barom Reachea Soriyotei Barom Reachea Sri Raja }

Case 3 : male_names = { "Ang Chan" "Barom Reachea" Soriyotei Barom Reachea "Sri Raja" }

Case 4 : male_names = { "Ang Chan" Barom Reachea Soriyotei "Barom Reachea" "Sri Raja" }

See you later,

Best Regards,

guy038

Marc Lalonde

Just saw this, ty for responding, in this case, B would work nicer, and case 1 more less, sometimes single words are in quotes, most times they are not, but they too can potentially have a quoted one and a unquoted one (which are counted as duplicates)

so

male_names = { "Ang Chan" "Barom Reachea" "Soriyotei" "Barom Reachea" "Sri Raja" Soriyotei }

would need both the “Barom Reachea” extra and one of the two Soriyotei (preference leaning to quoted one) removed

guy038

Hi, @marc-lalonde and All,

would need both the “Barom Reachea” extra and one of the two Soriyotei (preference leaning to quoted one) removed

I’m really sorry but I did not fully understand this sentence ! Do you mean that you want to delete the two occurrences of the "Barom Reachea" name, in your example ?

So, do you mind using that simple syntax, below, which clearly shows what you expect to ?

BEFORE : male_names = { "Ang Chan" "Barom Reachea" "Soriyotei" "Barom Reachea" "Sri Raja" Soriyotei }

AFTER : ???

On the other hand, what to do if there are more than 2 duplicates ? So, again, with the same syntax :

BEFORE : male_names = { "xxxxxx" "Barom Reachea" "Soriyotei" Barom Reachea "yyyyyy" "Barom Reachea" "zzzzzz" Soriyotei }

AFTER : ???

Finally, are the names, within double quotes or no, always separated with a single space character only ? Or is this kind of list possible ( with multiple spaces / tabulations, even 0 ), like below ?

male_names = { "xxxxxx"          "Barom Reachea" "Soriyotei" Barom Reachea	"yyyyyy" "Barom Reachea""zzzzzz" Soriyotei }

Just note, that according to the general behaviour of the N++ regex engine, the easiest way would be to delete all duplicates but the last one , whatever the name is surrounded, or not, with double quotes !

So, given the previous example :

male_names = { "xxxxxx" "Barom Reachea" "Soriyotei" Barom Reachea "yyyyyy" "Barom Reachea" "zzzzzz" Soriyotei }

we would obtain, AFTER replacement :

male_names = { "xxxxxx" "yyyyyy" "Barom Reachea" "zzzzzz" Soriyotei }

Cheers,

guy038

Marc Lalonde

ok let me try again, say we have this string

BEFORE : male_names = { “Ang Chan” “Barom Reachea” “Soriyotei” “Barom Reachea” “Sri Raja” Soriyotei }

AFTER : male_names = { “Ang Chan” “Barom Reachea” “Sri Raja” Soriyotei }

Marking for readability

{ |“Ang Chan”| |“Barom Reachea”| |“Soriyotei”| |“Barom Reachea”| |“Sri Raja”| |Soriyotei| }
{ |“Ang Chan”| |“Barom Reachea”| |“Sri Raja”| |Soriyotei| }

basically its read inside the | | marks if the inside of them is the same, then its a duplicate.There is only 1 space between each item. and in any case of duplication only one is kept
Eg 123 "123" 123 654 123 123 987 12
only one 123 would be kept

guy038

@marc-lalonde and All,

Perfect I’ve got the right regex to get what you want ! Just follow the few steps, below :

First, of course, do a backup of your file !
Open your file in N++
Open the Replace dialog ( Ctrl + H )
Type in the regex (?-is)\x20(?|"(\w[\w ]+\w)"|(\w[\w ]+\w))(?=\x20(.+?\x20)?("?)\1\3\x20) , in the Find what: zone
Leave the Replace with: zone EMPTY
Tick the Wrap around and the Regular expression options, ONLY
Click on the Replace All button or, repeatedly, on the Replace button

Et voilà !

Remarks :

I forgot to ask you about the case of the names : I mean, for instance, are the names Devi, DEVI, devi, DeVi,… all identical ? If so, just change the modifiers part (?-is), at beginning of the regex, into (?i-s)
I assume that each block begins with an opening brace, followed with a space character and ends with a space character followed with a closing brace
This regex correctly delete the first occurrence of two consecutive identical names, as below :

male_names = { "xxxxxx" "Barom Reachea" Barom Reachea "zzzzzz" }

Note that it’s not very easy to visualize the entire gap between two duplicates, because of the overlapping phenomena, which may occur, like in :

male_names = { "xxxxxx" "Barom Reachea" "Soriyotei" Barom Reachea "yyyyyy" "Barom Reachea" "zzzzzz" Soriyotei }

So just use the Find dialog, with the regex above, which will match and select the first occurrence of duplicate names, before performing the Replace operation

Next time, Marc, I’ll give you some information on the regex itself, if you want to !

Best Regards,

guy038

P.S. :

May be, this complicated regex can be shortened ! Anyway, this one is working fine :-)

guy038

Hello, @marc-lalonde and All,

I’m answering to myself ! For equivalent matches, I could simplify the regex to that shorter version :

SEARCH (?-is)\x20("?)(\w[\w ]*\w)\1(?=\x20(?:.+?\x20)?("?)\2\3\x20)

BTW, this new regex handle names with a minimum of two word characters ( Only one-letter names are not supported !! )

Cheers,

guy038

Marc Lalonde

While it did get 201 instances remaining, it doesn’t get them if they are on different Lines, … anyway of adjusting it to do so?

Example

Line 1 :123 321 654
Line 2 :123 951 753 “123”
Line 3: 456 852 “753 123”
Line 4: 123 “321” 852

123, 321 are both dupes in this example

guy038

Hi @marc-Lalonde

Use this regex, below :

SEARCH (?s-i)\x20("?)(\w[\w ]*\w)\1(?=\x20(?:.+?\x20)?("?)\2\3\x20)

Compared to my previous one, I’ve just changed the part (?-is) by the syntax (?s-i)

So assuming your last example ( which needed some re-formatting, to be sure that names, with double quotes or not, are surrounded with space chars ! ), and with, BTW, 3 duplicates : 123, 321 and 852 !

{ 123 321 654 }
{ 123 951 753 "123" }
{ 456 852 "753 123" }
{ 123 "321" 852 }

AFTER replacement, we get the new text below :

{ 654 }
{ 951 753 }
{ 456 "753 123" }
{ 123 "321" 852 }

I hope, this last version is the good one ;-))

Best Regards

guy038

Marc Lalonde

Thats very nearly got it, just need to detect the closing bracket

EG
{
Line 1: 123 654
line 2: 321 456
}
{
Line 1: 123 654
Line 2: 582 456 123
}

Would only hit on the 123 pair in the second set

Ie
{
Line 1: 123 654
line 2: 321 456
}
{
Line 1: 654
Line 2: 582 456 123
}

Marc Lalonde

I do have to say thank you though, if its not possible to get it to just look inside each set of { } as a unique group in one mass sweep. then at the very least its made the job that much more manageable. some 300 entries were caught by the scripts above, which is 300 i don’t have to look for. so thanks again there guy038

guy038

Hi, Marc,

I don’t give up :-)) Just making numerous tests. It should be OK, very soon…

guy038

Marc Lalonde

would it be better if you could see exactly what im trying to work with to make the script? I would be willing to link Via team-viewer (https://www.teamviewer.com/en/) so you have a better idea of exactly is needed, still got some 1100 duplicates still to hunt down. fyi im still stuck behind the 2 rep wall of only 20 mins between posts >.<

guy038

Hello, @marc-Lalonde and All,

I think that my new version should be very close to to your needs ! Just try it out :-)

So, assuming the following hypotheses :

Names with one word of two word character, minimum or with several words, separated with, at least, one space character, possibly surrounded by double-quotes
Names are preceded with, at least, one space char or are located at the beginning of a line
Names are followed with , at least, one space char or are located at the end of a line
Each block of names is embedded in a {.........} block, in a single line or split on several lines
If a name is preceded of followed with a brace, one space character, at least, must separate them

Then, the correct version for removing duplicates names of each block, only, should be :

SEARCH (?s-i)(?:^|\x20+)("?)(\w[\w ]*\w)\1(?=(?:\x20+|\R)(?:[^{}]*(?:\x20+|\R))?("?)\2\3(?:\x20|\R))

Remark :

Compared to my previous version, this regex is more complex. Son for a best understanding, here is the equivalent version, with the free-spacing regex mode, which allows to insert non-significant space characters, in the regex !

SEARCH (?xs-i) (?: ^|\x20+) ("?) (\w[\w ]*\w) \1 (?= (?: \x20+|\r?\n) (?: [^{}]* (?: \x20+|\r?\n) )? ("?) \2 \3 (?: \x20|\r?\n) )

So given, for instance, the initial text below :

{
123 123 654
 321 456
 999 852 666 "852"
 123
}

{
 123 654 222 333 999
 852 "999" 456 123
 000 "123 654" "999"
}

{
 "123 654" 555 "111"
 852 111 "123" "000"
 999 "333" "123 654" 000 333
}

{
111 789
789
789 222
333 789 444
}
{
555 "789"
"789"
"789" 666
777 "789" 888
}

{            3456      "3456"      3456       }

{ "6789" 6789 "6789" }

{
 "456" "12 34 56" 456 "123" "456" 123 "12 34 56" 789
}

We get the following text :

{
 654
 321 456
 999 666 "852"
 123
}

{
 222 333
 852 456 123
 000 "123 654" "999"
}

{
 555
 852 111 "123"
 999 "123 654" 000 333
}

{
111

 222
333 789 444
}
{
555

 666
777 "789" 888
}

{      3456       }

{ "6789" }

{
 "456" 123 "12 34 56" 789
}

Waooooooo ! This regex totally drained me ;-))

Cheers,

guy038

Marc Lalonde

That got almost All of my goal, one more small effort should finish this. Here is a screenshot of the full file structure, the only things i see it missing right now, is at the very start of the line, (Reference to the right screen/side) which is 3-4 tabs in. didnt think that would be an issue but its seeming to be. Other than that, it got all 19 alone in this section.

https://imgur.com/a/MO0dV

One hell of a job so far, this will get me able to finish this tonight likely (even with just this script part) myself and i very much appreciate the help.

Marc Lalonde

At the very least, as i just ran the extended script above on all my files, 1100 errors down to all of 115 in one click. I owe you a drink :D

Only things its missing right now, are the start of lines, 2-3 tabs in (reference screenshot above) and ones with punctuation mid word. EG: Abu’l-Ghazi

But if its just 115 errors, i can handle that without much more work :D

Again i cant thank you enough, and ill be sharing this script with a fellow person having to deal with a very similar issue.

guy038

@marc-Lalonde and All,

Ah, I see ! So, I just changed the \x20+ syntax with the \h+ one, to include tabulations and No-Break space characters as possible separators

Secondly, in order to consider the apostrophe ' and the hyphen - as possible word character, I changed the syntax (\w[\w ]*\w) with the (\w[\w '-]*\w) one !

So, the final version of the regex is, from now on :

SEARCH (?s-i)(?:^|\h+)("?)(\w[\w '-]*\w)\1(?=(?:\h+|\R)(?:[^{}]*(?:\h+|\R))?("?)\2\3(?:\h|\R))

and, with the free-spacing mode, which allows to identify the different parts of this regex, it gives :

SEARCH (?xs-i) (?: ^|\h+) ("?) (\w[\w '-]*\w) \1 (?= (?: \h+|\r?\n) (?: [^{}]* (?: \h+|\r?\n) )? ("?) \2 \3 (?: \h|\r?\n) )

Just tell me if other characters, than the apostrophe, the hyphen and the space characters, may exist in your list of names :-))

Cheers

guy038

P.S. :

Note that, with the free-spacing regex (?x), the \R syntax is forbidden ! So, I changed it by the usual \r?\n syntax !

Marc Lalonde

Luckily late last night i realized i somehow blanked a file relating to this so i had a fresh copy to hit with the most recent revision. It tagged 1600 dupes. Running my validator over it, catches four different occurrences it failed. totaling just 20 errors.

Instances it failed.
__
“Ko cheng”

location, start of line after tabs, its twin was second from end of same line
__
This one i imagine will be a bit tricky :/ (if even possible)

ZhanYong,

Location anywhere, its twin has the Y lowercase instead.
__
'Abd

Location anywhere, culprit probably punctuation at start
__
cont.

Location anywhere, Probably reverse reason as above
__
Cheers. and Ive spread the one from last night to a few people, they pass their thanks to you for this. it saves so much time

guy038

Hello, @marc-lalonde,

OK ! So, I changed the part of the regex , (\w[\w '-]*\w), responsible of matching the name, that is to be deleted. The new regex is ['.,]?(\w[\w '.-]*\w)['.,]?, which means that a name :

Begins with a word character, possibly preceded by an apostrophe ( ' ), a dot ( . ) or a comma (,) symbols
Contains, afterwards, a sequence, possibly empty of word characters or an apostrophe ( ' ) , a dot ( . ), an hyphen ( - ) or a space symbol
And ends with with a word character, possibly followed by an apostrophe ( ' ), a dot ( . ) or a comma ( , ) symbols

Only the inner part, beginning and ending with a word character, is considered as the group 2, which must occur, further on, a second time. Note also, that names, with some leading or trailing symbols, may be surrounded, again, by double quotes, thanks to the syntax : ("?)['.,]?(\w[\w '.-]*\w)['.,]?\1

On the other hand, it’s important to point pout that the duplicate name matched, with the regex ("?)['.,]?\2['.,]?\3 :

Can have leading or trailing symbols, different from the first occurrence, to be deleted
Can be surrounded, or not, with double quotes, independently, too, from the first occurrence, to be deleted

To end with, the names, with a single double-quote ( as "xxxx or yyyy" ) are considered as invalid entities. Indeed, let’s suppose the initial text, below :

{ 000 "123 555 456" 999 "123 456" 789 }

If names must be surrounded with double-quotes, or not, we get the same text, as there is no duplicate :

{ 000 "123 555 456" 999 "123 456" 789 }

If names as "123 or 456" were allowed, we would get the wrong text, below :

{ 000 555 999 "123 456" 789 }

So Marc, the new regex, below, should, correctly, miss very few names ;-)) And, thus, get rid of the great majority of the duplicates !

SEARCH (?si)(?:^|\h+)("?)['.,]?(\w[\w '.-]*\w)['.,]?\1(?=(?:\h+|\R)(?:[^{}]*(?:\h+|\R))?("?)['.,]?\2['.,]?\3(?:\h|\R))

And, with the free-spacing mode, which allows to identify the different parts of this regex, it gives :

SEARCH (?xsi) (?: ^|\h+) ("?) ['.,]? ( \w[\w '.-]*\w ) ['.,]? \1 (?= (?: \h+|\r?\n) (?: [^{}]* (?: \h+|\r?\n) )? ("?) ['.,]? \2 ['.,]? \3 (?: \h|\r?\n) )

Oh, I forgot to say that the search, is, from now on, insensitive to the case, due to the modifiers syntax (?is), at beginning of the regex. So, assuming the text :

{ ZhanYong Zhanyong }

The first word would, indeed, be a duplicate of the second one ans, thus deleted !

Best Regards,

guy038

Marc Lalonde

Since i finished my files, I just took the main original version and first ran it though the validator, 1876 errors, after running the script over it and validator. It got every single one. Only one Minor minor issue that doesn’t really have to be fixed, is it strips the last two closing } brackets at the very end of the file, that takes all of 5 seconds to re-add, i consider this a completed script. I very much appreciate the help for the last 24 hours, it probably saved me double if not more.