Find and replace line not starting with pattern and copy text from previous line

nitin jain

I have CSV text file which have Chat logs starting with “[” and after that it contains system date & time, there are few lines which not starting with square brackets as chat user may have sent multiples line in that message. Now i want to find all those lines and copy the previous line date & time with sender name in starting of that.

how do I find lines not starting with “[” and than replace text from previous line user with “[ ]” time stamp.

Here is input and desired output file .

INPUT

[25/11/19, 16:26:33] Roger: Not received mail
[25/11/19, 16:27:04] Niks: Refresh
[25/11/19, 16:28:12] Roger: Plz send again
[25/11/19, 16:28:55] Niks: ok sent
[25/11/19, 16:29:14] Roger: Received ok thanks
[25/11/19, 16:29:38] Niks: 👍🏻
‎[26/11/19, 13:20:31] Roger: ‎<attached: 00000110-PHOTO-2019-11-26-13-20-31.jpg>
[26/11/19, 13:20:57] Roger: For 60000 units
balance trf 59160
[26/11/19, 13:30:55] Niks: Units Batch code
SEGX-TWHB5Z 2500
3CRD-QAMXD9 2500
E4ZY-7HNK35 2500
SGMV-FR4P5Y 2500
[26/11/19, 13:32:55] Roger: Ok thanks

============================================
Output![alt text](image url)

[25/11/19, 16:26:33] Roger: Not received mail
[25/11/19, 16:27:04] Niks: Refresh
[25/11/19, 16:28:12] Roger: Plz send again
[25/11/19, 16:28:55] Niks: ok sent
[25/11/19, 16:29:14] Roger: Received ok thanks
[25/11/19, 16:29:38] Niks: 👍🏻
‎[26/11/19, 13:20:31] Roger: ‎<attached: 00000110-PHOTO-2019-11-26-13-20-31.jpg>
[26/11/19, 13:20:57] Roger: For 60000 units
[26/11/19, 13:20:57] Roger: balance trf 59160
[26/11/19, 13:30:55] Niks: Units Batch code
[26/11/19, 13:30:55] Niks: SEGX-TWHB5Z 2500
[26/11/19, 13:30:55] Niks: 3CRD-QAMXD9 2500
[26/11/19, 13:30:55] Niks: E4ZY-7HNK35 2500
[26/11/19, 13:30:55] Niks: SGMV-FR4P5Y 2500
[26/11/19, 13:32:55] Roger: Ok thanks

Finding lines in Yellow color & replace with Orange text which is taken from previous lines.

Thanks in advance

PeterJones

@nitin-jain ,

I came up with something that mostly works, except for on the line after the 👍🏻. I haven’t figured out why that is messing up my regex:

FIND = (?-s)^(\x5B.*?\x5D.*?:\h*).*$\R\K(?!\x5B)
REPLACE = $1
SEARCH MODE = regular expression
REPLACE ALL multiple times, until it’s really all done

The line after the thumbs-up has difficulty because, as @Alan-Kilborn found a couple years back, such chat logs will occasionally have the U+200E LEFT-TO-RIGHT MARK peppered throughout… and that photo line contains two instances. So before running the regex I showed, also replace \x{200E} with empty … or I’d suggest [\x{200B}-\x{200F}\x{202A}-\x{202F}] with empty (which will get rid of other zero-width characters).

So run the zero-width replacement first, then REPLACE ALL multiple times on the first one I shared until all the lines have that. (If your longest block is four lines that don’t start with [, you will have to REPLACE ALL four times)

guy038

Hello, @nitin-jain, @peterjones and All,

Oh… Peter beats me at it ! Here is my solution, quite similar !

If you are sure that all the dates are in increasing order, simply use the following regex S/R :

Open the Replace dialog ( Ctrl + H )
SEARCH (?-s)^(\[.+\]).+\R\K(?=\w)
REPLACE \1$0\x20
Tick the Wrap around option
Select the Regular repression search mode
Click several times on the Replace All button, till you see the message 0 occurrences were replaced in entire file

So, from the INPUT text :

[25/11/19, 16:26:33] Roger: Not received mail
[25/11/19, 16:27:04] Niks: Refresh
[25/11/19, 16:28:12] Roger: Plz send again
[25/11/19, 16:28:55] Niks: ok sent
[25/11/19, 16:29:14] Roger: Received ok thanks
[25/11/19, 16:29:38] Niks: 👍🏻
‎[26/11/19, 13:20:31] Roger: ‎<attached: 00000110-PHOTO-2019-11-26-13-20-31.jpg>
[26/11/19, 13:20:57] Roger: For 60000 units
balance trf 59160
[26/11/19, 13:30:55] Niks: Units Batch code
SEGX-TWHB5Z 2500
3CRD-QAMXD9 2500
E4ZY-7HNK35 2500
SGMV-FR4P5Y 2500
[26/11/19, 13:32:55] Roger: Ok thanks

you’ll get the expected OUTPUT result :

[25/11/19, 16:26:33] Roger: Not received mail
[25/11/19, 16:27:04] Niks: Refresh
[25/11/19, 16:28:12] Roger: Plz send again
[25/11/19, 16:28:55] Niks: ok sent
[25/11/19, 16:29:14] Roger: Received ok thanks
[25/11/19, 16:29:38] Niks: 👍🏻
‎[26/11/19, 13:20:31] Roger: ‎<attached: 00000110-PHOTO-2019-11-26-13-20-31.jpg>
[26/11/19, 13:20:57] Roger: For 60000 units
[26/11/19, 13:20:57] balance trf 59160
[26/11/19, 13:30:55] Niks: Units Batch code
[26/11/19, 13:30:55] SEGX-TWHB5Z 2500
[26/11/19, 13:30:55] 3CRD-QAMXD9 2500
[26/11/19, 13:30:55] E4ZY-7HNK35 2500
[26/11/19, 13:30:55] SGMV-FR4P5Y 2500
[26/11/19, 13:32:55] Roger: Ok thanks

Best regards,

guy038

guy038

Hi, all,

I suppose I find out a bug, in our Boost regex engine !

Let’s take this simple example tet :

[]xyz
‎[]xyz
[]xyz
ABC
[]xyz
DEF

Now, the four regexes below, should look for the literal string [] beginning a line, followed with any non-null string and its line-ending chars, ONLY IF not followed with a leading [ symbol !

(?-s)^\\[\\].+\R(?!\\[)

(?-s)^\\[\\].+\r\n(?!\\[)

(?-s)^\x5b\x5d.+\R(?!\x5b)

(?-s)^\x5b\x5d.+\r\n(?!\x5b)

Thus, it should only match the txo lines, below :

The []xyz line before the string ABC
The []xyz line before the string DEF

But, unfortunately, it also matches the first line []xyz ???

Am I wrong in any way, in this matter ?

BR

guy038

Alan Kilborn

@guy038 said in Find and replace line not starting with pattern and copy text from previous line:

But, unfortunately, it also matches the first line []xyz ???
Am I wrong in any way, in this matter ?

When I copy -n-paste your black box data, there is an LRM in it, which seems to cause your erroneous match!

PeterJones

@alan-kilborn said in Find and replace line not starting with pattern and copy text from previous line:

there is an LRM in it, which seems to cause your erroneous match!

Indeed. I was originally going to ask Guy why my regex wasn’t working with the supplied data (the same question Guy asked us), when I happened to left arrow from the [ and stayed on that same line! That told me there was a hidden character, which is why I ran the reveal-hidden-characters script from the old conversation, and I saw the infamous LRM – which is why I added the paragraph to tell @nitin-jain to do the zero-width search/replace before doing the main search/replace.

guy038

Hi, @nitin-jain, @peterjones, @alan-kilborn and All,

Ah ah ! Alan, I, first, didn’t understand why you had the LRM sigle in the second line of my text. My second thought was that you created a Python script to make all these fancy Unicode format characters clearly visible ! But, luckily, marking any \x{200e} character did the trick and showed me a thin red mark when this special char is present !

So, @nitin-jain, as @peterjones said, use this simple regex S/R, below, to get rid of these format characters !

SEARCH [\x{200B}-\x{200F}\x{202A}-\x{202F}]

REPLACE Leave EMPTY

However, verify that this operation does not break down your text in any way ! I personally saw this case, while pasting Unicode characters from a long list, produced by this excellent and valuable site, regarding Unicode :

https://r12a.github.io/uniview/

Now, I’m pleased to note that there is no bug of our Boost regex engine, in this matter, as that special LRM char is quite a character different from a [ symbol !

BR

guy038

Alan Kilborn

@guy038 said in Find and replace line not starting with pattern and copy text from previous line:

Alan, … you created a Python script to make all these fancy Unicode format characters clearly visible

Well, yes, I did. :-)

Ahamed Nawas Ali

@PeterJones I have a similar scenario where i have 10K lines i need to fix, is there any shorter way? Also, is there any way we can unmark the line number for those identified lines which does not start with a pattern.

Example: My line number starts with datetime (2021-09-14T21:10:55+00:00)

And can i make all these lines which does not start with “2021-” without line numbers provided by notepad++?

PeterJones

@Ahamed-Nawas-Ali said in Find and replace line not starting with pattern and copy text from previous line:

is there any shorter way?

The way described above is reasonably short. I am not sure what “improvement” you think is necessary (or even possible).

is there any way we can unmark the line number for those identified lines which does not start with a pattern.

Sorry, I don’t understand how that’s different than the original question.

You’ll have to give a better example – use the </> button on the toolbar when you are writing the post to create pairs of ```, between which you can paste your actual data:, something like

**data I have**:
```
[1234] abcxyz
next line
[5678] pdq
aonther
```

**desired data after transformation**
```
[1234] abcxyz
[1234] next line
[5678] pdq
[5678] aonther
```

…
This would be rendered as the following, so we know exactly what your “before” and “after” data needs to be.
—
data I have:

[1234] abcxyz
next line
[5678] pdq
aonther

desired data after transformation

[1234] abcxyz
[1234] next line
[5678] pdq
[5678] aonther

----

Useful References

Ahamed Nawas Ali

@PeterJones, Thanks for your reply. I am sorry, I am new to this platform.

Example scenario i am dealing with is with Date_Time Sender Recepients Message delimited with ‘Tab’

2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	Learning
Selection
B. Home
Webinar 
IDB
20214980
202216
2021-09-15T11:19:14+00:00	Ahamed Ali	Nawas	Thanks!

And i should make it like below

2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	Learning
2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	Selection
2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	B. Home
2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	Webinar 
2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	IDB
2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	20214980
2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	202216
2021-09-15T11:19:14+00:00	Ahamed Ali	Nawas	Thanks!

Alan Kilborn

@Ahamed-Nawas-Ali said :

Example scenario i am dealing with is with Date_Time Sender Recepients Message delimited with ‘Tab’

For this one, I considered the following to be enough to distinguish a timestamp line leader: 2021-09-15T

Thus I tried (based upon @guy038’s solution earlier in this thread):

Find: (?-s)^(\d{4}-\d\d-\d\dT.+\t).+\R\K(?!\d{4}-\d\d-\d\dT)
Replace: ${1}
Options: Wrap around, Regular expression
Action: Replace All (multiple times, until no more changes occur)

And after several Replace All presses, I obtained the following:

2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	Learning
2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	Selection
2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	B. Home
2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	Webinar 
2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	IDB
2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	20214980
2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	202216
2021-09-15T11:19:14+00:00	Ahamed Ali	Nawas	Thanks!
2021-09-15T11:19:14+00:00	Ahamed Ali	Nawas

Note that the last line of this output is “extra” and should be manually removed.

Ahamed Nawas Ali

@Alan-Kilborn I tried with below in order to keep clicking on the buttons to replace everything and it worked in removing the line numbers however the strings are concatenated.

Find box: \n([^2021-])
Replace box: $1

Result:

2021-09-14T21:10:55+00:00	Nawas	Ram Kumar,Ahamed Ali	LearningSelectionB. HomeWebinar IDB20214980202216
2021-09-15T11:19:14+00:00	Ahamed Ali	Nawas	Thanks!

And the results are messed up a bit. Anyway, thank you so much for the time stamp line leader and for now, i will have to use it anyway to avoid further delay in my project! Thanks @guy038 & @PeterJones for your guidance! Greatly appreciate your guidance to this community! God bless you all!

Alan Kilborn

@Ahamed-Nawas-Ali said:

\n([^2021-])

That’s totally wrong for what you’re wanting… in several ways…
But since you seem to be in a hurry…and you can’t reasonably do anything with regex in a hurry…I won’t explain and I’ll just wish you good luck.

Ahamed Nawas Ali

@Alan-Kilborn Sorry Alan! I know i am wrong with that “\n([^2021-])” as it will spoil my delimiter as well and there could be some other issues as well. Its true that one can’t learn Regex in a hurry! I am using yours snippet and thank you for that!

guy038

Hello, @ahamed-nawas-ali, @peterjones, @alan-kilborn and All,

@ahamed-nawas-ali, I’ll use a similar search regex to the @alan-kilborn’s one !

For example , given this INPUT text , below :

2021-09-14T21:10:55+00:00	ATX	Field3	Guy	Field5	Learning
Selection
B. Home
Webinar 
IDB
20214980
2021420214202216
2021-09-15T11:19:14+00:00	BYQ	Field3	Alan	Field5	Test
B. Home
Webinar 
IDB
20214980
2021-09-16T15:07:46+00:00	ATX	Field3	Peter	Field5	Try
Selection
B. Home
Webinar 
IDB
20214980
2021420214202216
Blablah
OK
END of story

Open the Replace dialog ( Ctrl+H )
Uncheck all box options
Search (?-s)^(\d{4}-.+\t).+\R\K(?!\d{4}-|\R|\z)
Replace $1
If necessary, check the Wrap around option
Select the Regular expression search mode
Click, exclusively, on the Replace All button, several times, till the message Replace All: 0 occurrences were replaced... is displayed !

At the end, you should get this expected OUTPUT text :

2021-09-14T21:10:55+00:00	ATX	Field3	Guy	Field5	Learning
2021-09-14T21:10:55+00:00	ATX	Field3	Guy	Field5	Selection
2021-09-14T21:10:55+00:00	ATX	Field3	Guy	Field5	B. Home
2021-09-14T21:10:55+00:00	ATX	Field3	Guy	Field5	Webinar 
2021-09-14T21:10:55+00:00	ATX	Field3	Guy	Field5	IDB
2021-09-14T21:10:55+00:00	ATX	Field3	Guy	Field5	20214980
2021-09-14T21:10:55+00:00	ATX	Field3	Guy	Field5	2021420214202216
2021-09-15T11:19:14+00:00	BYQ	Field3	Alan	Field5	Test
2021-09-15T11:19:14+00:00	BYQ	Field3	Alan	Field5	B. Home
2021-09-15T11:19:14+00:00	BYQ	Field3	Alan	Field5	Webinar 
2021-09-15T11:19:14+00:00	BYQ	Field3	Alan	Field5	IDB
2021-09-15T11:19:14+00:00	BYQ	Field3	Alan	Field5	20214980
2021-09-16T15:07:46+00:00	ATX	Field3	Peter	Field5	Try
2021-09-16T15:07:46+00:00	ATX	Field3	Peter	Field5	Selection
2021-09-16T15:07:46+00:00	ATX	Field3	Peter	Field5	B. Home
2021-09-16T15:07:46+00:00	ATX	Field3	Peter	Field5	Webinar 
2021-09-16T15:07:46+00:00	ATX	Field3	Peter	Field5	IDB
2021-09-16T15:07:46+00:00	ATX	Field3	Peter	Field5	20214980
2021-09-16T15:07:46+00:00	ATX	Field3	Peter	Field5	2021420214202216
2021-09-16T15:07:46+00:00	ATX	Field3	Peter	Field5	Blablah
2021-09-16T15:07:46+00:00	ATX	Field3	Peter	Field5	OK
2021-09-16T15:07:46+00:00	ATX	Field3	Peter	Field5	END of story

Voila :-))

Notes :

As you can see, the number of columns, before the last one, is not a problem !
From beginning of line ( ^ ), the regex looks for a line beginning with 4 digits, followed with a dash character (\d{4}- ) and anything else till the last tabulation ( .+\t ) of current line
This search, so far, is memorized and stored as group 1
After the last field of the line and the line-break ( .+\R ), all the matched string is discarded ( \K )
Thus, the regex engine is now searching for a zero-length string, at beginning of the next line, but ONLY IF this next line does not begin with :
- 4 digits and a dash char
- An other line-break
- The very end of current file
When this assertion is true, it just inserts the group 1 contents at the very beginning of current line

Best Regards

guy038

P.S. :

If the condition to detect the header lines seems not restrictive enough, you may use this alternate search regex :

Search (?-is)^(20\d\d-\d\d-\d\dT.+\t).+\R\K(?!20\d\d-\d\d-\d\dT|\R|\z)

Find and replace line not starting with pattern and copy text from previous line

============================================ Output![alt text](image url)

Useful References

============================================
Output![alt text](image url)