batch function - need to add filename at the end of each paragraph

Meta Chuh

welcome to the notepad++ community, @Marina-Susan

we would need more information to evaluate if it is possible to achieve what you need with notepad++.

if it is one (or many) text file(s) that you’d like to have modified by adding a filename to the corresponding lines, if the filename matches your criteria, we would need more information about your criteria.

please keep in mind, that even if we have all needed information about your criterias, to add - filename (without extension) at specific lines of your text files, before importing them to excel, we can not promise any success, as we can’t evaluate yet, if notepad++ is able to provide a solution, using it’s build in functions.

it might be of help if you could provide us with one of your text files (unless it contains sensitive data), together with some of your criteria “if … then” examples

Marina Susan

Hi, Sure let me give more context and specifics.

Some context: This is data that I am going to port into excel, but i want the filename after each paragraph so when i have all the data in excel (I have a few hundred files) and sort them, i know what file it came from. Originally I should have created tables in word for the data, but there’s so much data there now it would take me weeks (maybe longer) to convert everything into a table, adding the filenames manually. So I thought if I could add the filename at the end of each paragraph I could batch import it into excel, into one file, making it easier to track.

This is a very long list of website names and websites in each text file. So here’s some examples of what it looks like now:

website 1 - www.website1.com
website 2 - www.website2.com
website 3 - www.website3.com

and so on throughout a multitude of text files.

I would love it (just to reiterate) for the filename to be added to the end of each line with a dash and the filename, so i know what file it came from. ultimately i would love it to look like this:

website 1 - www.website1.com - filename1
website 2 - www.website2.com - filename1
website 3 - www.website3.com - filename1

Hopefully this answers your questions?

Meta Chuh

@Marina-Susan

here is a regular expression search and replace for your given example, assuming that the filename number should correspond to your website number.

note that it will probably only work on your example from above and has to be adapted to your real life data.

if you paste this content, based on your sample, into a new notepad++ tab for testing:

website 1 - www.website1.com
website 2 - www.website2.com
website 3 - www.website3.com
website 45 - www.website45.com

then go to the notepad++ menu: search > replace and copy/paste the following:

find what: ^(.*?)([1-9][0-9]*)(.*?)$
replace with: $1$2$3 - filename $2
select the search mode “regular expression” (important)

now if you hit “replace all”, your result will be:

website 1 - www.website1.com - filename1
website 2 - www.website2.com - filename2
website 3 - www.website3.com - filename3
website 45 - www.website45.com - filename45

a little explanation, if you have to adapt this regular expression (short name: regex) for your real data:

find what: ^(.*?)([1-9][0-9]*)(.*?)$:

^ tells regex to start searching at the beginning of every line (line by line)

(.*?) is a wildcard, meaning the next text can be anything, until the next condition is found. the content of this (.*?) is saved to string variable $1, needed to output it later at the “replace with” section

([1-9][0-9]*) this searches for any number after your line fragment website and saves it to string variable $2

(.*?) again a wildcard (containing everything after the number, so it’s - www.website123.com in this example), this part will be saved to variable $3

$ at the end of the search string, tells regex to stop at the end of a line instead of reading the whole document

replace with: $1$2$3 - filename $2:

this outputs the strings $1, $2 and $3 together and adds " - filename " and $2 which contains the number found
between website and - www.website123.com

i hope this helps you getting started, using regex in notepad++.

Terry R

@Marina-Susan and others

I have been mulling over this question and I think it is entirely possible to add the real filename (sans extension) to the end of every line in selected files using just the standard Notepad++ (NPP) app (no plugins required nor any programming language).

My version (used in this test) is the installed (not portable) 7.5.8 32-bit. There are a few steps involved, but none particularly complicated. I am running under Windows 10, if you aren’t then some steps below might need adjusing to suit your environment.

In words, the files to be processed need loading into NPP. Once loaded the filename for each tab (each file) is saved in a new last line of that tab, then the file is saved and closed. I have created a macro to achieve this (shown later). Running this for the number of tabs (files) updates, saves and closes all those files, end of 1st step.

In the second step we use a regex to process each of those files, taking the filename in the last line of each file, and appending that to the end of every line in that file without the extension. That should solve the problem.

So now to the specifics. I could explain the steps to create the macro, however the easiest method is to open NPP just for the adding of the macro (shown next) to a particular file. Once NPP is open, select option to open a file, type in %appdata%\Notepad++\shortcuts.xml and press enter. See shortcut.xml image (alternative link: https://imgur.com/a/CTrXbLh ) for my file with the macro. Sorry, I typo’d the macro name (fielname append). To save typing in all those characters, take (copy, Ctrl-C) the following and insert (Ctrl-V) them into your shortcuts.xml file at the position shown. If you already have macros it might look slighlty different, hopefully you can figure where to do the insert. Macro is:

<Macro name="fielname append" Ctrl="no" Alt="no" Shift="no" Key="0">
            <Action type="0" message="2319" wParam="0" lParam="0" sParam="" />
            <Action type="0" message="2451" wParam="0" lParam="0" sParam="" />
            <Action type="1" message="2170" wParam="0" lParam="0" sParam="&#x000D;" />
            <Action type="1" message="2170" wParam="0" lParam="0" sParam="&#x000A;" />
            <Action type="2" message="0" wParam="42030" lParam="0" sParam="" />
            <Action type="0" message="2179" wParam="0" lParam="0" sParam="" />
            <Action type="2" message="0" wParam="41006" lParam="0" sParam="" />
            <Action type="0" message="2025" wParam="0" lParam="0" sParam="" />
            <Action type="0" message="2422" wParam="0" lParam="0" sParam="" />
            <Action type="0" message="2325" wParam="0" lParam="0" sParam="" />
            <Action type="2" message="0" wParam="41003" lParam="0" sParam="" />
        </Macro>

The macro does the following, on a tab, it right clicks on the tab’s tab (which is the filename), then selects the option “Filename to Clipboard”. Then the entire contents of that tab is selected, this allows for positioning of cursor at end of file, then another “end” will unselect the entire selection leaving the cursor at the end, then 1 “enter” keystroke to make a new line, then we paste the clipboard contents (filename) onto the new last line. Next we save the tab, and then close it (Ctrl-S and Ctrl-W). When all the files have been loaded (unknown how many can be loaded but someone posted they had 2000 files loaded, albeit possibly small files) the number of tabs can be found by selecting “Window” in top menu (hopefully)?! We run the macro named “fielname append” we created the number of times equal to the number of files loaded by selecting “Macro” and then “Run a Macro Multiple Times” and typing that number into the window once correct macro selected.

If all works as expected no tabs will remain open in NPP and a check (of some randomly selected files) should show the files processed will have a new last line with that filename shown in that line.

Now we use a regex and the “Find in Files” option to change all these files. So we have:
Find What:((.+)(\R)(?=(.+\R)+(.+)(\..*)))|(.+)(\R)(.+)(\..*)
Replace With:(?{2}\2 - \5\3)(?{7}\7 - \9)

Search mode MUST be “regular Expression”. For the other fields in “Find in Files” select the folder you want to work with (contains the files you processed in the earlier step) and if necessary use the filter if not ALL files to be altered.

The regex should start on the first line, then looks ahead to the last line and grabs the filename without the extension and inserts it on the end of the first line behind a “<space>-<space>”. The filename HAS to have at least 1 <period> character (.), although if more that is okay as only the last “.” and any possible subsequent characters are removed. The regex then repeats for all subsequent lines, but it works slightly differntly on the last “real data” line, it moves the filename up to its’ line, again removing the extension.

Well, that looks quite complicated doesn’t it, but in reality ALL steps are simple, I’ve just spent a lot of typing trying to FULLY explain the steps.

I suggest FULLY reading this solution and if anything is not understood please do come back to here and post another question. There is one possible issue you might encounter and it relates to the “lookahead” in the regex. There have been numerous postings of issues if the file is large and number of lines looking ahead anywhere from about 10000 and upwards. It’s not possible to know in advance when the problem will arise, we ALL just hope we never encounter it.

Good luck
Terry

guy038

Hello, @Terry-R and All,

Regarding the macro, I would shorten it as below :

        <Macro name="Filename Append" Ctrl="no" Alt="no" Shift="no" Key="0">
            <Action type="0" message="2318" wParam="0"     lParam="0" sParam="" />          <!-- SCI_DOCUMENTEND         -->
            <Action type="1" message="2170" wParam="0"     lParam="0" sParam="&#x000D;" />  <!-- Adding CR               -->
            <Action type="1" message="2170" wParam="0"     lParam="0" sParam="&#x000A;" />  <!-- Adding LF               -->
            <Action type="2" message="0"    wParam="42030" lParam="0" sParam="" />          <!-- IDM_EDIT_FILENAMETOCLIP -->
            <Action type="0" message="2179" wParam="0"     lParam="0" sParam="" />          <!-- SCI_PASTE               -->
            <Action type="2" message="0"    wParam="41006" lParam="0" sParam="" />          <!-- IDM_FILE_SAVE           -->
            <Action type="2" message="0"    wParam="41003" lParam="0" sParam="" />          <!-- IDM_FILE_CLOSE          -->
        </Macro>

Indeed, I don’t see the benefit of the three messages 2025 ( SCI_GOTOPOS ), 2422 ( SCI_SELECTION MODE ) and 2325 ( SCI_CANCEL ) !

Now, Terry, I tried your regex S/R and there’s a small bug, during replacement, about line ending chars ! We get, for instance :

CR
- FileNameLF

Personally, I didn’t fully understand what the OP wants to achieve ! She said :

Hi, I have a bunch of text files I need to add the filename at the end of each paragraph to.

and

… i want the filename after each paragraph …

To my mind, the term paragrah should refer to an area of text, surrounded with, at least, two line-break or at the very start/end of document

Don’t forget that , after processing the Terry macro, on multiples files of a specific directory, all the file contents end with the a line-break followed with the filename, without its extension and any other line-break

So, let’s suppose the text, below, in a file, named TEST.TXT :

This is a
simple text
for better
understanding



This is a
simple text
for better
understanding

This is a
simple text
for better
understanding
TEST.TXT

with NO line-break, after the last line TEST.TXT

Does the OP would like the output A, below :

This is a
simple text
for better
understanding - TEST



This is a
simple text
for better
understanding - TEST

This is a
simple text
for better
understanding - TEST

In this case, a possible regex S/R could be :

SEARCH (?-s)(\R\R+)(?=(?s).*\R(.+)\..*)|(\R)(?=(.+)\..*)|.+\z

REPLACE (?1\x20-\x20\2\1)(?3\x20-\x20\4\3)

May be, the OP would prefer this following B output :

This is a - TEST
simple text - TEST
for better - TEST
understanding - TEST
 - TEST
 - TEST
 - TEST
This is a - TEST
simple text - TEST
for better - TEST
understanding - TEST
 - TEST
This is a - TEST
simple text - TEST
for better - TEST
understanding - TEST

In this second case, a possible regex S/R could be :

SEARCH (?-s)(\R)(?=(?:(?s).*\R)?(.+)\..*)|.+\z

REPLACE ?1\x20-\x20\2\1

Finally, the OP would expect this C output :

This is a - TEST
simple text - TEST
for better - TEST
understanding - TEST



This is a - TEST
simple text - TEST
for better - TEST
understanding - TEST

This is a - TEST
simple text - TEST
for better - TEST
understanding - TEST

In this final case, a possible regex S/R would be :

SEARCH (?-s)(\R+)(?=(?:(?s).*\R)?(.+)\..*)|.+\z

REPLACE ?1\x20-\x20\2\1

Best Regards,

guy038

Terry R

@guy038 and all

Thanks Guy for checking my solution. It’s always helpful to have another double check the results.
For the first part, my keystrokes, were to select the entire content of the file using (I think) ‘Ctrl-Shift-End’, then another ‘End’ to deselect and position cursor at end, then the carriage return/line feed ready for the filename to be inserted. I didn’t actually check what each keystroke coded into at the macro stage, just that it worked. As a matter of interest what does 2318 (SCI_DOCUMENTEND) equate to as a keystroke? I couldn’t find anything that would ‘immediately’ position at end of doc so used the key combo I listed above.

As for the OP’s statement about “…add the filename at the end of each paragraph…”, I agree that can be up for interpretation, however clearly the OP’s example showed each line as having a different website and their desired result also showed the filename on each line. So I took it to mean each line. I recall we’ve had several OP’s mention paragraphs and in each case they really meant each line.

I think if people come from a writing background (used to writing in MS Word or equivalent) they will only ever use an ‘enter’ key at the end of a paragraph, this is what I term a ‘hard return’. All other lines have ‘soft returns’ on them, which allows the software (MS Word) to adjust line endings based on font size and width or writing area.

I’m not sure about the bug in my regex. As it shows in the post, I ran it on a small text file:
File content as saved by macro

website 1-1 - www.website1.comCRLF
website 2-1 - www.website2.comCRLF
website 3-1 - www.website3.comCRLF
new 1.txt

file content as updated by regex

website 1-1 - www.website1.com - new 1CRLF
website 2-1 - www.website2.com - new 1CRLF
website 3-1 - www.website3.com - new 1

In my example above I’ve manually inserted the CR and LF where they appear for me.
Are you able to identify how we have different results?

Terry

guy038

Hi, @terry-r and All,

1st point :

If you open the Shortcut Mapper, and click on the Scintilla Commands tab, you’ll see that SCI_DOCUMENTEND function, at line 71, is simply mapped to the… Ctrl + End shortcut ;-))

2nd point :

Thanks for your explanations regarding the term “paragraph”. So, according to your interpretation, it should refer to my cases B) or C) of my previous post. Now, what about pure blank lines ? I noticed that your regex does not match these blank lines. Hence, the 3rd point below :

3rd point :

Let’s suppose the text :

This is a
simple text
for better
understanding



This is a
simple text
for. better
understanding

This is a
simple text
for better
understanding
TEST.TXT

Just note that I added a dot, in the sentence for. better, in the middle section

Now, processing your regex S/R, below, against this text above :

SEARCH ((.+)(\R)(?=(.+\R)+(.+)(\..*)))|(.+)(\R)(.+)(\..*)

REPLACE (?{2}\2 - \5\3)(?{7}\7 - \9)

we get the following text :

This is a
simple text
for better
understanding



This is a - for
simple text - for
understanding

This is a - TEST
simple text - TEST
for better - TEST
understanding - TEST

There are two problems :

As you’ve just taken in account non-blank lines, in your regex, it seems to behave ( irony ! ) just as my old interpretation of paragraphs !

- Due to the **dot** inserted in the sentence **`for. better`** of the **middle** section, your regex has supposed that the **two** lines *"This is a"* and *"simple text"* were followed by the filename **`for`** with the **extension** **`. better`** :-((

This explains why I noticed a weird behavior of your regex ! In my first tests, I used the change.log file, which, unfortunately, contains some dots, for instance, in the numbered list of all the bug-fixes ! This leads to the last point, below.

4th point :

Due to the macro process, we are sure that the filename, added at the very end of each file, is never followed by a line-break. So, we would rather anchor the location of this filename to the \z assertion, which stands for the very end location, in each file, in order to avoid possible problems with lines containing dots !

Moreover, the use of the (?-s) modifier assure us that the part (.+)\..*\z refers to the filename, without any ambiguity !

So, the 3 search regexes, of my previous post, must be modified as below :

Case A (?-s)(\R\R+)(?=(?s).*\R(.+)\..*\z)|(\R)(?=(.+)\..*\z)|.+\z

- Case **B** **`(?-s)(\R)(?=(?:(?s).*\R)?(.+)\..*\z)|.+\z`**

- Case **C** **`(?-s)(\R+)(?=(?:(?s).*\R)?(.+)\..*\z)|.+\z`**

Cheers,

guy038

Terry R

@guy038

thanks for showing me where the scintilla commands are available in NPP. I did go to the shortcut mapper, but then (in my defense) I looked down, not up to the tabs across the top. So I was ONLY looking at the ‘main menu’ shortcuts. Well, I did say (to another poster) that I’m still learning ;-))

I get the issue with my regex if the data becomes more ‘paragraph’ orientated and includes blank lines, and also with not asserting the lookahead to retrieve the filename off the very last line using \z.

What I don’t get was you said:
CR
- FilenameLF

Where did that come from? What I showed in reply was that I couldn’t see how my regex managed to split the CR and LF and insert the filename between?

The input I expect (although the OP has never come back with a better description) is EVERY line with be of the format:
website 1 - www.website1.com
website 2 - www.website2.com
website 3 - www.website3.com
My solution hasn’t attempted to cater for any non website lines, nor additional blank lines.

I think at this point we await a reply from @Marina-Susan on whether any of our solutions has helped, or whether she wants to supply more examples so we could (better) help her.

Cheers
Terry

guy038

@terry-R,

I understood why I was able to split the CR-LF set, with your regex !

I remember that, after testing your regex, I tried several versions of my own and, this time, I noticed that the . matches newline option was checked

So, I unchecked that option and decided to add the (?-s) modifier, at beginning of my regexes.

This explains the text obtained, after replacement, with your regex :

CR
- FileNameLF

with, as sample text, the change.log file, where I added, at the very end the name change.log ( See below, where I explicitly wrote the CRLF line-endings :

Notepad++ v7.6.3 new enhancements and bug-fixes:CRLF

1.  Add Markdown language (Markdown++: https://github.com/Edditoria/markdown-plus-plus), in UDL, included only in installer.CRLF
.....
.....
.....
.....
Updater (Installer only):CRLF
CRLF
* WinGup (for Notepad++) v5.1CRLF
CRLF
change.log

The part of your regex, involved in this result is your first alternative :

SEARCH ((.+)(\R)(?=(.+\R)+(.+)(\..*)))

REPLACE (?{2}\2 - \5\3)

To explain this “pseudo” bug, just simplify this S/R, as below :

SEARCH (.+)(\R)(?=.+\R(change)\.log)

REPLACE \1 - \3\2

So, considering that :

Your regex dos not contain any leading in-line modifier
The . matches newline option has been checked ( by mistake), so the dot can match, for example, the CR character
Keeping in mind that \R may match, either CRLF, or CR or LF

This implies :

Group 1 matches from "Notepad++ v7.6.3 " to “WinGup (for Notepad++) v5.1CR”
Group 2 matches the LF character, which follows the CR of group 1
Part .+, in the look-ahead, matches the single CR char of the next blank line
Part \R, in the look-ahead, matches the single LF char of the next blank line
Group 3 matches the literal change string
Finally, \.log matches the .log literal string

Therefore, after a first replacement action, we, logically, obtain :

Notepad++ v7.6.3......................WinGup (for Notepad++) v5.1CR

- changeLF

Cheers,

guy038

Meta Chuh

@Terry-R @guy038 and all

@Terry-R: I think at this point we await a reply from @Marina-Susan on whether any of our solutions has helped, or whether she wants to supply more examples so we could (better) help her.

i second that, and i guess this thread became another “after work” playground of ours 😉
albeit some of those playgrounds for regulars, have been very productive lately, due to a higher, non simplified, level of information exchange 👍

i hope @Marina-Susan does not mind (@Marina-Susan you can kick in with any questions anytime)