Notepad++ String Search

dario33

I think no, cause the content of every textstring in the whole text is in every case different. In the text are more textstrings that i have to get (about 4). Marking by hand works and is the thing i do right now, but it is annoying when there is an easier way

guy038

Hello Dario33,

So, seemingly, you would like to get all text, between the two boundaries "ais":" and "},{"drg2" and delete any character, which lies outside these boundaries, as for instance :

Text to delete"ais":"Text 1 to keep"},{"drg2"Other text to delete"ais":"Text 2 to keep"},{"drg2"Again, a text to delete"ais":"Text 3 to keep"},{"drg2"Final text to delete

And afterwards, I suppose that your would, ONLY, get the list of all the parts of text, in several lines, as below :

Text 1 to keep
Text 2 to keep
Text 3 to keep

If so, I think that a search/replacement, with regular expressions, should do the job ! but I’d better verify my assertion with a real text. So, if you don’t mind and, of course, if your file , or part of it, is not confidential, you could send me a quick e-mail, with some example text, as an attached file, so that I’ll be able to test some regexes, against your encrypted text :-)

My e-mail address is

See you later,

Best Regards,

guy038

dario33

That’s it guy038, many thanks for the explanations! I can’t describe it so excellent as you do! The text isn’t encrypted and also no special secret, but it is too large to post it here so i uploaded the textfile here:

[link text]http://www.filedropper.com/sampletext(link url)

TIA!

guy038

Hi Dario33,

First, I downloaded your file sampletext.txt without any problem : I get a one-line text of 32376 bytes, exactly, without any End of Line character
Secondly, I tried to identify your two boundaries "ais":" and "},{"drg2" : I did find 12 occurrences of the boundary "ais":". Unfortunately, there NO ONE occurrence of the boundary "},{"drg2", in your file. Nevertheless, I succeeded to identify 12 occurrences, exactly, of the boundary "drg2":". So I assume that “drg2”:" is the right ending boundary !!

So, just follow the few steps, below :

Open your sampletext.txt, in Notepad++
Go back to the very beginning of your file ( IMPORTANT )
Open the Replace dialog ( CTRL + H )
Set the Regular expression search mode
Type in the two zones :

Find what : “drg2”:“.*?“ais”:”

Replace with : \r\n
Click on the Replace All button

This FIRST S/R delete any text, between an ending boundary and the closest next opening boundary, as well as the boundaries, themselves and replace it with the two Windows End of Line characters \r\n.

Note : If you currently use Unix files, just use the End of Line character \n !

Now, change the Find what and Replace with zones into :

Find what : ^.?“ais”:“|“drg2”:”.?$

Replace with : Leave EMPTY
Click, again, on the Replace All button ( The cursor location shouldn’t have changed, after the first S/R )

This SECOND S/R delete any text, between the very beginning of this long line and the remaining opening boundary "ais":" OR between the remaining ending boundary "drg2":" and the very end of this long line.

Et voilà ! You should obtain 12 lines that represent, ONLY, the text that was present, in your original text, between the two boundaries :-) Quite logical, as there were 12 opening boundaries and 12 ending boundaries !

Notes :

It’s not easy to gather these two regexes in a bigger one, because of the special behaviour of the ^ assertion, that means Beginning of Line. Indeed, as long as you just perform a search, there’s no trouble : the beginning of each line does NOT change. However, when successive replacements are performed, the beginning of each line is updated, each time and is NOT the same as it was, before all these S/R. Thus, it’s safer to split the job into two S/R :-))
Dario33, I did a test, copying all your text twice, getting a two-lines text. After the two consecutive S/R, I got, as expected, 24 lines :-)

Cheers,

guy038

dario33

Hi guy038, THANK YOU SO MUCH! You’re life safer :) This code is really awesome!!

One question i still have, please: Is it possible to set before the extracted textlines a fixed wordstring like ‘Sentence’ or so? TIA!

guy038

Hi Dario33,

No problem ! Just insert, in replacement of the first S/R, after the End of Line characters \r\n, the string "Sentence : ", like below :

Find what      :     "drg2":".*?"ais":"

Replace with   :     \r\nSentence :                 ( with a SPACE after the COLON )

Cheers,

guy038

dario33

Hi guy038, it works perfect, thank you so much again!! As i understand, one can extract the wanted textline by delimiting.

But what is in the case, i only know that the first zone “ais” and some of the second zone like “drg2”, “aim” or “bbk” for example.
Can i seperate like following to get all of them: “drg2”:“.*?“ais”:” plus “aim” plus “bbk” (also a command which includes near the drg2 also the seperating of aim and bbk) ? Sorry if this question is too silly ;)

guy038

Hello Dario33,

Right now, it’s about 12.30 a.m., in France and, as soon as I was awoken at about 10h30, after a good night( It’s the week-end anyway ! ), I understood that the regex, given in my previous post, was NOT exact :-((

So, the FIRST S/R is, as I said :

Find what    :       "drg2":".*?"ais":"

Replace with :       \r\nSentence :                 ( with a SPACE after the COLON )

but, the SECOND S/R should be :

Find what    :       (^.*?"ais":")|"drg2":".*?$

Replace with :       (?1Sentence \: )               ( with a SPACE, before and after the TWO characters \: )

Notes :

The boundaries "ais":" and "drg2":" are only literal strings, to be matched
The dot stands for any character, different from the End of Line characters \r and \n, as well as the Form Feed character \f
The star means any repetition, even 0, of the previous character ( the dot )
To understand the role of the question mark, after the star, I think that a short example will be better than a long speech !

Suppose the given subject string : “This is an example for the meaning of the question mark symbol”. Then :

The regex a.*r would match the string “an example for the meaning of the question mar”
The regex a.*?r would match the string “an example for”

So, in our example :

The form .* means the longest range of characters, even empty, between the letters a and r
The form .*? means the shortest range of characters, even empty, between the letters a and r

In the second S/R, the ^ assertion means beginning of line and the $ assertion means End of line
The | symbol represents an alternation between two regexes. As it has the lowest priority level, you’ll need, sometimes, to enclose the two parts of an alternation, inside round brackets. For instance :
- The regex abc(123|789)xyz would match the strings abc123xyz OR abc789xyz
- The regex abc123|789xyz would match the strings abc123 OR 789xyz
The round brackets generate a group that contains the regex ^.*?"ais":" ( i.e. all the text between the beginning of a line and the first “ais”:" boundary )
Finally, in the replacement part, the syntax (?1Sentence \: ) represents a conditional replacement. Its general form is (?nText if TRUE:Text if FALSE), with n as a digit, that means :
- If the group n EXISTS ( TRUE ), the text Text if TRUE is re-written
- If the group n does NOT exist ( FALSE ), the text Text if FALSE is re-written

So, in our example, the string Sentence : is added if the regex ^.*?"ais":" is matched and nothing is written when the regex "drg2":".*?$ is matched

Note that the colon is escaped with the \ symbol, because it’s has a special meaning, in replacement part

As for your question, about several boundaries :

If I use the syntax Bn, for a beginning boundary n, En, for an ending boundary n, B0 for a single opening boundary and E0 for a single ending boundary, which kind of organization would you like to ?

......B1.......E1..................B2.........E2.........B1.........E1...........B3.......E3..........

or perhaps :

................B0.........E1...............B0............E2.........B0.................E3............

or, maybe :

........B1.............E0........B2......................E0..................B3...........E0..........

Of course, if the following case, below, would happen :

........B1...........B2.................E1...........E2................

There’s an ambiguity and the regex would, probably, consider the range B1 - E1, and NOT the range B2 - E2 !!

So just tell me all the strings, used as opening and ending boundaries and their different location, in the simple way, as above and I’ll try to built the right regexes :-)

Cheers,

guy038

dario33

thx you so much for the fast clarification - will try out it later. You rock really with this helpful codes!!

dario33

Hi guy038, this all works fine for me - thx so much again! For the several boundary issue, i have to think over, i have a wrong explanation for it! I think this question is soled!