Copy, search and replace between 2 HTML files

guy038

Hi, @hientwi and All,

Ah…, of course, It cannot work because, there are a random number of lines between each KOSMOS line ! So, here is an other method which should work fine, although it contains numerous steps ;-))

To begin with, from your pictures, I noticed that your file A contains 223,145 lines and I assume that your file B contains 895 lines only

OK, let’s go !

Open your two files A and B in Notepad++

Let’s suppose the following file A, containing only 5 lines KOSMOS, among the 223,145 lines of file A, then the input text :

Line 1
Line 2
Line 3
KOSMOS
Line 5
KOSMOS
Line 7
KOSMOS
Line 9
.....
.....
.....
.....
Line 223,139
KOSMOS
Line 223,141
Line 223,142
KOSMOS
Line 223,144
Line 223,145

Open the Column Editor`
- Select Number to Insert
- Type in 1 in the following three zones
- Tick the Leading zeros option
- Verify the Dec format
- Click on the OK button

You should get :

000001Line 1
000002Line 2
000003Line 3
000004KOSMOS
000005Line 5
000006KOSMOS
000007Line 7
000008KOSMOS
000009Line 9
xxxxxx.....
xxxxxx.....
xxxxxx.....
xxxxxx.....
223139Line 223,139
223140KOSMOS
223141Line 223,141
223142Line 223,142
223143KOSMOS
223144Line 223,144
223145Line 223,145

Now open the Mark dialog ( Search > Mark... option )
- SEARCH (?-i)KOSMOS
- Option Bookmark line ticked
- Option Purge for each search ticked, preferably
- Option Wrap around ticked
- Mode Regular expression selected
- Click on the Mark All

=> The 895 lines KOSMOS should be bookmarked

Then, run the option Search > Bookmark > Copy bookmarked Lines
Now, select your File B tab, containing also 5 lines, which will replace each KOSMOS line of file A

-- The Line 1 contents ( File B ) --
-- The Line 2 contents ( File B ) --
-- The Line 3 contents ( File B ) --
-- The Line 4 contents ( File B ) --
-- The Line 5 contents ( File B ) --

After the 895 lines of file B, add a separation line with, at least, 3 consecutive equal signs, so the string === with a line-break
Then paste the contents of the clipboard, with Ctrl + V ( so the 895 lines KOSMOS of file A )

Thus, the contents of file B should contain 895 lines before the ===: line and 895 after ( 5, in our example )

-- The Line 1 contents ( File B ) --
-- The Line 2 contents ( File B ) --
-- The Line 3 contents ( File B ) --
-- The Line 4 contents ( File B ) --
-- The Line 5 contents ( File B ) --
===
000004KOSMOS
000006KOSMOS
000008KOSMOS
223140KOSMOS
223143KOSMOS

Perform the following regex S/R, in the Replace dialog ( Ctrl + H )
- SEARCH (?-si).+(?=\R(?s:.+?\R){5}(.+))|(?s)===.+ ( Of course, use the quantifier {895}, instead of {5}, with your present file B )
- REPLACE ?1\1$0
- Option Wrap around ticked and Regular expression selected
- Click on the Replace All button

After 895 replacements ( 5, in our example ), we get, at once, the following text :

000004KOSMOS-- The Line 1 contents ( File B ) --
000006KOSMOS-- The Line 2 contents ( File B ) --
000008KOSMOS-- The Line 3 contents ( File B ) --
223140KOSMOS-- The Line 4 contents ( File B ) --
223143KOSMOS-- The Line 5 contents ( File B ) --

Then select all the contents of file B, with Ctrl + A
Copy it into the clipboard, with Ctrl + C
Select the file A tab
Paste the clipboard contents, after the last line of file A, with Ctrl + V

=> So, the file A contents are as below :

000001Line 1
000002Line 2
000003Line 3
000004KOSMOS
000005Line 5
000006KOSMOS
000007Line 7
000008KOSMOS
000009Line 9
xxxxxx.....
xxxxxx.....
xxxxxx.....
xxxxxx.....
223139Line 223,139
223140KOSMOS
223141Line 223,141
223142Line 223,142
223143KOSMOS
223144Line 223,144
223145Line 223,145
000004KOSMOS-- The Line 1 contents ( File B ) --
000006KOSMOS-- The Line 2 contents ( File B ) --
000008KOSMOS-- The Line 3 contents ( File B ) --
223140KOSMOS-- The Line 4 contents ( File B ) --
223143KOSMOS-- The Line 5 contents ( File B ) --

Now, sort the lines of file A, with the option Edit Line operations > Sort Lines Lexicographically Ascending

We get the following output :

000001Line 1
000002Line 2
000003Line 3
000004KOSMOS
000004KOSMOS-- The Line 1 contents ( File B ) --
000005Line 5
000006KOSMOS
000006KOSMOS-- The Line 2 contents ( File B ) --
000007Line 7
000008KOSMOS
000008KOSMOS-- The Line 3 contents ( File B ) --
000009Line 9
xxxxxx.....
xxxxxx.....
xxxxxx.....
xxxxxx.....
223139Line 223,139
223140KOSMOS
223140KOSMOS-- The Line 4 contents ( File B ) --
223141Line 223,141
223142Line 223,142
223143KOSMOS
223143KOSMOS-- The Line 5 contents ( File B ) --
223144Line 223,144
223145Line 223,145

Finally, run this last regex S/R :

SEARCH (?-is)^\d{6}|\h*KOSMOS\h*\R?
REPLACE Leave EMPTY

Here we are ! We have the expected output, below :

Line 1
Line 2
Line 3
-- The Line 1 contents ( File B ) --
Line 5
-- The Line 2 contents ( File B ) --
Line 7
-- The Line 3 contents ( File B ) --
Line 9
.....
.....
.....
.....
Line 223,139
-- The Line 4 contents ( File B ) --
Line 223,141
Line 223,142
-- The Line 5 contents ( File B ) --
Line 223,144
Line 223,145

If OK, I’ll explain the regexes syntax, next time !

See you later,

Best Regards,

guy038

HienTwi

Hi @guy038 and all,

Definitely, it works perfectly with @guy038 smart solution. Many many many thanks for your solution which helps me a lots to save my time. It would be really nice if you can explain the regexes syntax, when you have free time!

In addition, I want to split file A into 895 files based on “KOSMOS”. Could you please give me a further favor? For instances,

file 1: From the very beginning of file A to the first KOSMOS, but not include it.
file 2: From the 1st KOSMOS to the 2nd KOSMOS (not include the 2nd)
file 3 ,… file 895 are similar file 2. The last KOSMOS (895th) I will be excluded.

Bests,
Kosmos

HienTwi

@astrosofista many thanks for your comments. The problem is solved with @guy038 solution.

astrosofista

@HienTwi

Good to know. Thank you for getting back to me.

Best Regards.

guy038

Hello, @hientwi, @astrosofista and All,

I’m quite confused, because I don’t see, exactly, the connexion between your previous goal and your new one ?

Indeed, once your file A has been modified with our previous process, it does not contain any KOSMOS line which have all been replaced with a specific line from file B. So, it would be more difficult to determine each section which would have to be saved in the 895 files !

On the other hand, If you decide to split the initial contents of file A into 895 files, first, then you’ll have to replace the first KOSMOS line of each file by the appropriate line of file B which seems to be more difficult than with my previous method !

Please, could you enlighten us ?

Best Regards,

guy038

HienTwi

Hi @guy038 and all,

Sorry that I made you and others confused. I have another purpose which is totally different from my previous question. It means that I have two copies of file A. The one I wanted to split into multiple files based on “KOSMOS”. The other is used for my previous question. They are totally different questions.

Best regards,
Kosmos

guy038

Hello, @hientwi, @astrosofista and All,

Sorry to be late ! So OK : these are two tasks absolutely different !

Well, as you would like to manage file’s creation, regexes are not a nice tool for such a task. Personally, I would use the Gawk application. So, if you do not have this program, yet :

Create a new folder
Download the gawk-5.0.1-w32-bin-zip archive from https://sourceforge.net/projects/ezwinports/files/
Double-click on the gawk-5.0.1-w32-bin-zip archive
Double-click on the bin folder
Extract only the 5 files gawk.exe, libgmp-10.dll, libmpfr-4.dll, libncurses5.dll and libreadline6.dll in the new folder
Copy your file A in that folder, which will be renamed as File_A.txt
With N++, just add a line KOSMOS, at the very beginning of File_A.txt
Open a DOS cmd window
Type in and run the following command :
- gawk "BEGIN {n=0} $0!=\"KOSMOS\" {print > \"File_\"n\".txt\"} $0==\"KOSMOS\" {n++}" File_A.txt
Wait a few moments … …

Et voilà ! You should see, in this new folder, 895 files from File_1.txt to File_895.txt ;-))

An other possibility would be :

With N++, just add a line KOSMOS, at the very beginning of File_A.txt
Change, in your File_A.txt, each KOSMOS line into a pure empty line, with the regex :
- SEARCH (?-i)^KOSMOS(?=\R)
- REPLACE Leave EMPTY
Then, in your DOS window, you would run the following command :
- gawk "BEGIN {n=0} NF {print > \"File_\"n\".txt\"} !NF {n++}" File_A.txt

That’s all ! Powerful, isn’t ?

Remark : I suppose that your file did not contain, initially, any true empty line !! ( may be searched with the regex ^\R )

For more information, you can download the latest PDF manual ( gawk v5.0 ) from https://www.gnu.org/software/gawk/manual/

Best Regards

guy038

P.S. :

In order to select each zone, beginning with a KOSMOS line, till the next KOSMOS line, excluded, of your File_A.txt, simply use the regex :

SEARCH (?-i)(KOSMOS)?(?s).+?(?=^KOSMOS\R|\z)

HienTwi

Dear @guy038 and all,

I am so sorry that I responded too late. It seems that everything can be soIved with you. Many thanks in advacne and I will let you know later on.

Stay healthy and best regards,
Kosmos

Kosmos Huynh

Dear @guy038, dear all

Today, I have tried your first solution (File_B.txt which contains KOSMOS) and I got the error as in the following:

It is the same with your second solution with File_A.txt with blank line) as well.

Could you please kindly give me a favor?

Many thanks in advance!
Bests,
Kosmos

Kosmos Huynh

Dear @guy038 ,

I got the solution by correct quotations as the followings:

gawk ‘BEGIN {n=0} NF {print > “File_“n”.txt”} !NF {n++}’ File_A.txt

Best regards,
Kosmos.

Kosmos Huynh

This post is deleted!