Removing Text Before and After dialogue.

Borderless Media

I want to remove everything before and after each of the Chinese dialogues so that the chinese texts are all that remains. I’d also like for them to have 1 blank line in-between them so they are readable.

This XML file does not appear to have any style information associated with it. The document tree is shown below.





<DCSubtitle Version=“1.1”>
<SubtitleID>da75c3b7-4f-401-d3a</SubtitleID>
<MovieTitle>Bricks</MovieTitle>
<ReelNumber>1</ReelNumber>
<Language>cmn-hans</Language>
<LoadFont Id=“Font1” URI=“Bricks_STT_1_cmn_R1.ttf”/>
<Font Id=“Font1” Color=“FFFFFFFF” Effect=“border” EffectColor=“FF000000” Size=“42”>
<Subtitle SpotNumber=“1” TimeIn=“00:02:18:146” TimeOut=“00:02:20:115” FadeUpTime=“0” FadeDownTime=“0”>
<Text HAlign=“center” HPosition=“0.0000” VAlign=“bottom” VPosition=“10.0000”>这里只有一只</Text>
</Subtitle>
<Subtitle SpotNumber=“2” TimeIn=“00:02:20:208” TimeOut=“00:02:22:083” FadeUpTime=“0” FadeDownTime=“0”>
<Text HAlign=“center” HPosition=“0.0000” VAlign=“bottom” VPosition=“10.0000”>它们造出来的时候是一对</Text>
</Subtitle>
<Font Italic=“yes”>卡·克登场</Font>
</Text>

I’m very new to Notepad++. Any help will be appreciated alot!!

PeterJones

@Borderless-Media said in Removing Text Before and After dialogue.:

I want to remove everything before and after each of the Chinese dialogues so that the chinese texts are all that remains. I’d also like for them to have 1 blank line in-between them so they are readable.

If you really want to delete all the tags, and just leave things that aren’t part of the tags, it’s not that hard to do with regex. So assuming you have a backup of your data, what I would suggest is:

Delete from each < to its corresponding > (assuming you don’t have any nested; normally you don’t in valid XML)
FIND = (?s)<.*?>
REPLACE = \r\n
SEARCH MODE = Regular Expression
REPLACE ALL
- this finds each smallest <...> pair, and replaces them with a newline. This will likely leave multiple newlines between some pieces of text
Combine multiple newlines into one:
FIND = (\r\n)+
REPLACE = \r\n (if you just want a single line break) or \r\n\r\n (if you want double-spaced lines)
SEARCH MODE = Regular Expression
REPLACE ALL

What this does do: gets rid of tags (ie, the stuff between <...> pairs) but leaves all content.

What this does not do: verify whether the stuff that’s left is Chinese text. If you had Russian or Arabic or Hebrew or English or … elsewhere, it would still be there after this.

----

Useful References

Terry R

@Borderless-Media said in Removing Text Before and After dialogue.:

I want to remove everything before and after each of the Chinese dialogues so that the Chinese texts are all that remains. I’d also like for them to have 1 blank line in-between them so they are readable

I think I may have the answer. I cannot claim all the credit, I looked to an old post by @guy038 to find out the hex range of Chinese characters first. I then made an assumption with the example you provided. I assumed also that in any group of Chinese characters they commenced with a Chinese character and ended at the < character. That was because I noted that in the last group of Chinese characters there also appeared a space (see the raised .)

Anyways, for what it is, it did produce the desired result (including a blank line after each group of Chinese characters.

Using the Replace function and search mode set to “regular expression” we have
Find What:(([\x{4E00}-\x{9FFF}].+?)(?=<))|.\R?
Replace with:?{1}${1}\r\n\r\n

For an explanation we have:
(([\x{4E00}-\x{9FFF}].+?)(?=<)) - find a Chinese character (at the next position), if so then continue finding characters and stop when the next one is a <.
|.\R?` - this is alternation so if the previous code didn’t find a Chinese character then we grab this (one) character and any possible EOL (end of line)
?{1}${1}\r\n\t\n - in replacement if the first part of the find regex did find characters (group 1) we will return that group of Chinese characters followed by the end-of-line twice, this adds a carriage return & line feed behind the Chinese character group and then a second carriage return & line feed. The alternate code has no parenthesis around it (so no group definition) and that is because we want to consume it (destroy/delete), not return any of those characters.

Terry

guy038

Hello, @borderless-media, @peterjones, @terry-R and All,

An alternate solution to the @terry-R’s one could be :

Open your file or select the right tab
Move to the very beginning of the file ( Ctrl + Home )
Open the Replace dialog ( Ctrl + H )
Un-tick all the box options
SEARCH [^\x{4E00}-\x{9FFF}]+|(?-s)(.+)(?=</)
REPLACE ?1$0\r\n\r\n ( or ?1$0\n\n if you deal with Unix files )
Select the Regular expression search mode
Click on the Replace All button

Here you are !

I must admit, that I initially did not think about the Chinese Characters range. Special thanks for that clever idea, Terry ;-))

Best Regards,

guy038

P.S. :

This regex S/R works also, if a range of Chinese chars is split on several lines. After the replacement, this range is displayed in a single line, again !

Borderless Media

I want to Thank you guys for this. Yesterday I tried Peter’s code first and it worked but there were still some things left behind. Inevitably i found a wayto cancel them out so great work either way.

Terry for some reason My Notepad++ freezes when i used your method to replace all.

Guy038, yours worked perfectly. It got rid of everything and also spaced out the lines. I intended to put them into a word doc and realized that word automatically spaces out lines so it seems I put you guys through a bit more work than needed but those codes got the job done. I won’t be needing further help on this i believe.

Thanks again and warm regards