Finding and replacing all data between two points?
-
I hope you and your family had a good time! It’s been a long time, but my dad still misses going on field trips with me and my class when I was younger. Precious memories! :DD I’m glad he was with me when I fell in the fish pond at the local zoo!
Thank you for all of this! This is so close, aaaaaaahhhh! I just realized - this would’ve been a lot easier if I’d just linked one of the transcript files:
https://drive.google.com/open?id=141c3d3ZBOb_GG3Psb9eQaaiWUl-g-ATFI followed what you said to do in your most recent post. Doing the first regex (to cut out the extra data I don’t need) works, but when I do the second regex (to turn the multi-line dialogue into mono-line dialogue), the problem from my previous post still happens:
https://drive.google.com/open?id=11ChQl7D9YZ80COaxyadjjBl-tIp5aTPJ
I tried doing the mono-line regex first. Then that one did its job and looked like your expected text… but when I did the extra data removal regex after, then that one didn’t work!
-
Hi, @@joey-flaig and All,
This time, no trouble ! I’ve found the right regex S/R, which is, BTW, much more simple that anything else tried before !
As usual, having the exact bunch of data, to work on, is always the best solution !
After downloading your complete
XML
file, I noticed two things :-
Some dialog do not begin with any tag
<CLT ##>
nor<CLT>
=> We cannot rely on these tags and should prefer to consider<ORIGINAL N°###>
as an anchor :-)) -
When dialog is multi-lines, it is always split in two lines !
-
More anecdotal, your file is an
UNIX
file, with theUCS-2 LE BOM
encoding
So, as a summary :
-
To remove the extra data, not needed, use the
A
S/R :-
SEARCH
(?s-i)^\h*</ORIGINAL\x20N°(\d+)>.+?</COMMENT\x20N°\1>\R
-
REPLACE
Leave EMPTY
-
-
To change the two-lines dialog in a one-line dialog, use the
B
S/R :-
SEARCH
(?-si)<ORIGINAL N°\d+>\R.+\K\R(?!-{60}|</ORIGINAL N°\d+>)
-
REPLACE
\x20
-
IMPORTANT :
-
Assuming your present data, your may execute the search/replacement
B
, first and, secondly, theA
S/R ;-)) -
I’ve supposed that your separation line always contains
60
dashes, exactly. If not, change the value60
, between braces, in the search regex, accordingly -
Remember to click on the
Replace All
button, exclusively, when running theB
S/R !
NOTES on the
B
new regex :-
First, the part
<ORIGINAL N°\d+>\R.+
looks for the complete line<ORIGINAL N°###>
with its EOL character, followed with all the standard characters of the next line (1st
line of dialog ) -
Then, the syntax
\K\R
resets the regex engine and it, now, looks the the EOL characters of the1st
line of dialog -
But this search occurred ONLY IF the EOL chars are NOT followed with, either, a line of
60
dashes OR the string</ORIGINAL N°###>
Best regards,
guy038
-
-
@guy038
I wish I’d thought to send you the files themselves at first, haha! You’re great!The B S/R regex to turn two-line dialogues into one-line dialogues works like a charm!
However, the a S/R regex to remove the extra data doesn’t seem to work at all. I tried it in the order of B + A, then A + B, and even A by itself, but it’s not doing anything. I made sure to always use Replace All for both regexes. Is there a setting I shouldn’t have checked?
https://imgur.com/nIPe0w3Also, if it matters - I’m using the 32 bit version of NPP [I had to in order to use the Combiner plugin]
-
Hello @@joey-flaig and All,
Aaaah ! As soon as I saw the picture of your text, in the background, I understood what happened ;-))
Of course, considering the first entry
N°001
and supposing you’ve executed theB
regex S/R first (46
occurrences replaced ), I was expecting :<SPEAKER N°001>KIYOTAKA ISHIMARU</SPEAKER N°001> <ORIGINAL N°001> Would you like to study with me, Makoto? Just the two of us? </ORIGINAL N°001> <JAPANESE N°001> 苗木くん! どうだ、これから一緒に自習しないか!? </JAPANESE N°001> <TRANSLATED N°001> </TRANSLATED N°001> <COMMENT N°001> </COMMENT N°001>
But, seemingly, you’ve already changed the
N°001
entry, as below :<SPEAKER N°001>KIYOTAKA ISHIMARU</SPEAKER N°001> <ORIGINAL N°001> Would you like to study with me, Makoto? Just the two of us?</ORIGINAL N°001><JAPANESE N°001> 苗木くん!どうだ、これから一緒に自習しないか!?</JAPANESE N°001> <TRANSLATED N°001> </TRANSLATED N°001> <COMMENT N°001> </COMMENT N°001>
It easy to see why this new text layout breaks the logic of the
A
search regex. Indeed, I, initially, supposed that the part</ORIGINAL N°001>
was, always, at beginning of lines !So, the final S/R
A
becomes :SEARCH
(?s-i)(^)?\h*</ORIGINAL\x20N°(\d+)>.+?</COMMENT\x20N°\2>\R
REPLACE
?1:\r\n
Notes :
-
This new
A
regex looks for the string</ORIGINAL N°###>
, possibly preceded with blank characters, at any location of current line, due to the optional syntax(^)?
-
In replacement,
2
possibilities :-
If the string
</ORIGINAL N°###>
is at a beginning of current line, the group1
exists, so the conditional replacement?1
rewrites all text before the:
char, so …nothing
-
If the string
</ORIGINAL N°###>
is located elsewhere, the^
location is not true. So the group1
is not defined and the conditional replacement?1
rewrites all text, after the:
char, so a line break\r\n
-
That’s all ;-)) This time, you should see the message
Replace All: 82 occurrences were replaced
Remark : Of course, I, also, verified that your new text layout does not break the logic of S/R
B
!Cheers,
guy038
-
-
@guy038
It works! Thank you so much for all the help, and for accommodating my requests and frequent regex-breaking! It means a lot to me, and you’ve saved me soooo much time and hand pain!! Is there any way I can make it up to you? I could draw you something :D -
Hi Joey,
Thanks for your kind words ! You said :
Is there any way I can make it up to you?
Thanks, but I don’t need anything !! I’m just pleased that the last regexes are the good ones for your specific file !
This is the main point, indeed ! Regexes are very, very, very sensitive to text layout. So, once I’ve built up some regex for an OP, based on his provided examples, the OP should not add, delete or modify anything of the original text, in the meanwhile, as, probably, the regex will not work anymore ;-))
As far as possible, anyone, asking for regex solutions, should consider all cases of text layouts, of the original file to modify ;-))
It’s generally, not so obvious, and, in my personal work, I simply create successive versions of the regex to get a final version which handles all reasonable cases !
I say reasonable ( and not possible ) because, sometimes, we can’t think about all the possibilities and, anyway, this could lead to an huge and useless regex ;-))
Best Regards
guy038