Is it possible...?

guy038

Hello, @dipsi7772, @peterjones, @terry-r and All,

First, Peter said :

@guy038 is going to have to jump in to do it in one fell swoop. Barring that (he may be busy, or away, or not interested in this one)

Indeed, I was away for a long week-end !

Secondly, @dipsi7772 said :

This is a original line of the file:

/><br />password<br />target</div>

So I assume that the different targets are characters between the nearest > and < symbols, after the word password, with this exact case, followed with few characters. In this case, here is a destructive method, which need to be run only on a copy of your bunch of .html files

To sump up :

Copy the directory, containing all your .html files, to an other location
Start Notepad++
Open the Search in Files dialog ( Ctrl + Shift + F )
SEARCH (?s-i).*?password.+?>(.+?)(?=<)|.+
REPLACE \1\r\n ( or \1\n if your files use Unix EOL syntax )
FILTERS *.html
DIRECTORY The absolute location of the COPY of all your .html files ( Do NOT use your original files )
Tick the Wrap around button
Select the Regular expression search mode
Click on the Replace in Files button
Click on the Yes button of the small dialog Are you sure?

=> Each copy, of an original .html file, should have been drastically decreased and simply contains a list of all passwords, one per line, contained in the original file ;-))

Notes :

As usual, the (?s-i) in-line modifiers mean that :
- The dot . character represents any single character, even EOL chars ( (?s) )
- The search is processed in a sensitive to case way ( (?-i) )
Then, The part .*?password.+?> looks, from beginning of each file, for the shortest range of any character, till the string paasword, with this exact case, followed with some characters till the nearest > symbol
Now, the (.+?)(?=<) part stores, as group 1, all the subsequent characters till the condition contained in the look-around structure is true, i.e. till the nearest < symbol if found
In replacement, the “value” of each password \1 is simply rewritten, followed with new-line characters ( \r\n or \n )
At the end, if the word password , with this exact case, cannot be found, the second alternative .+, after the alternation symbol |, selects all the remaining characters of current scanned file and deletes them, because group 1 is not defined

Remarks :

Notice that all quantifiers of the first alternative of the search regex are lazy quantifiers, i. e. it grasps as little chars as possible, though satisfying the subsequent parts of the overall regex
When an .html file do not contain any word password, all its contents are just replaced with a single line-break !

Best Regards,

guy038

dipsi7772

Dear @Terry-R dear @guy038, dear @PeterJones
thank you so much for your support. All of you seem to be very smart people if I read through your thoughts and realize how difficult it is get the target out of the big text.

As I dont want to make you guys headache I have checked now several different files of the big bunch to find out a “rule” or a repeated “sign” which is situated around the target.

I found that there is always a </div> directly next to the target. But unfortunately there are also other lines with </div>, but without the target.

So at the end, the only fix mark is the word “pass” or the word “password” before the target. Unfortunately I have to inform about a further issue:
The word “password” is not everytime directly before the target … sometimes there is also letters numbers signs or free spaces between them.
Regarding your question: The target can contain every, sign,letter, special sign …everything you can imagine.

So from my unprofessional point of view:

its just possible to extract just the line which contains the strings “pass” “password” and “</div>”.
As I have almost 3000 html files to search, just extract the lines would also help me a lot.

Should I try your proposals now or have you any remarks?
I googled but still not sure about the meaning of Unix EOL

Search field should look like this?

Not sure about the search options you mentioned.
Thanks for all guys!!!

andrecool-68

@dipsi7772 You have the simple search tab open, and you need the following tabs of the search window, replace or find and replace in files

guy038

Hi, @dipsi7772, @peterjones, @terry-r, @andrecool-68 and All,

@dipsi7772, many thanks for trying to find out some general rules in order to isolate your different target strings more easily !

You said :

1 I found that there is always a </div> directly next to the target. But unfortunately there are also other lines with </div>, but without the target.

2 So at the end, the only fix mark is the word “pass” or the word “password” before the target. Unfortunately I have to inform about a further issue:

3 The word “password” is not everytime directly before the target … sometimes there is also letters numbers signs or free spaces between them.

4 Regarding your question: The target can contain every, sign,letter, special sign …everything you can imagine.

If so, I think that the new regex S/R , below, should meet these 4 criteria !

SEARCH (?s).*?(?-si:pass(word)?.+?>(.+?)(?=</div>))|.+

REPLACE \2\r\n ( or \2\n if your files use Unix EOL syntax )

The look-around (?=</div>), as well as the (?-si:pass(word)?... syntax, should satisfy your first criterion
The part (?-si:pass(word)?...), which matches the word pass or password, with this exact case, satisfies your second criterion
Then, the part .+?>, which matches the shortest range of standard chars, till a > symbol, satisfies your third criterion
Finally, the part (.+?), which stores, as group1, any range of standard characters, due to the in-line modifiers (?-si:....), till the nearest string </div> excluded ( located right after all the target characters ), satisfies your fourth criterion

As @andrecool-68 said, use the Find in Files dialog ( Ctrl + Shift + F )

Cheers,

guy038

P.S. :

So, except for the SEARCH and REPLACE updated zones, just follow the instructions given in my first post !

dipsi7772

Im soooo happy =) this works … Thank you all!! This made my day :)
You are great !! Thanks so much=)

guy038

Hi, @dipsi7772, @peterjones, @terry-r, @andrecool-68 and All,

@dipsi7772, you said :

I googled but still not sure about the meaning of Unix EOL

I missed that sentence. Refer to the link, below, for general information about new-line definition :

https://en.wikipedia.org/wiki/Newline#Representation

To be rigorous, I’ve made a slight error, in my last proposed search regex ! In order to extract target from lines of that form :

/><br />password>target</div> OR /><br />pass>target</div> ( when NO character exists between the string pass(word) and >target< )

I should have used the following search regex, with a *, instead of a +, at the indicated place !

SEARCH (?s).*?(?-si:pass(word)?.*?>(.+?)(?=</div>))|.+
                                ▲
                                │

Best Regards,

guy038

Alan Kilborn

@guy038 said in Is it possible...?:

https://en.wikipedia.org/wiki/Newline#Representation

OT to the main thread here, but it is interesting from that article that the Mac line-endings (carriage-return only) that Notepad++ still supports are for an OS whose last release was before the first release of N++ (if I have my dates straight).

dipsi7772

Dear guy038,

thanks for clarification. Your proposal works good. Anyway its still a job to manuallly copy the eexpressions out of the sentences :) but its way better than without the code.
Strange thin is, on some files. just <!doctype html> this is the result.
I would say al files are the same format , so Its mystious.
Anyway whis you a nice xMas all community members and thanks again=)

dipsi7772

Is it maybe possible to implement a rule that avoids duplicate results?

Another thing is, even I choose “Automatischer Zeienumbruch” each line is writte ine ONE line, not in a second one which woud avoid the vertical scrolling.

Best Regards Friends

guy038

Hi, @dipsi7772 and All,

You said :

Strange thin is, on some files. just <!doctype html> this is the result.
I would say al files are the same format , so Its mystious.

It’s not strange and it’s not related to file format at all ! The reason, is that , for files with big size, it may happen that the regex does not work properly and deletes all characters but the first line of your HTML files. Indeed, as the regex is :

(?s).*?(?-si:pass(word)?.*?>(.+?)(?=</div>))|.+

The beginning (?s).*?(?-si:pass(word)? means that the regex engine selects all characters, even displayed on several lines, from current position of the caret till the first word pass or password. In some files, this range of characters can be significative and this fact could explain the non-expected results !

If your HTML files are not important nor confidentiel, simply e-mail me one of these files, which produces errors. I’ll try to find out an other regex which works correctly, in all cases ;-))

Next, you said :

Is it maybe possible to implement a rule that avoids duplicate results?

My question is : In the copied HTML files, that contains the passwords ( 1 per line ), which is the maximum length of these files ?

Depending of this length, a regex solution may be possible… However, if you don’t mind changing the initial order of these passwords, just use, for each copied HTML file, the two menu options, below :

Edit > Line Operations > Sort Line Lexicographically Ascending
Edit > Line Operations > Remove Consecutive Duplicate Lines

Finally, you said :

Another thing is, even I choose “Automatischer Zeienumbruch” each line is writte ine ONE line, not in a second one which woud avoid the vertical scrolling.

I’m sorry because I cannot guess what you’re speaking of :-(( Depending on your file ending characters, discussed previously, and using the appropriate Replace regex :

\2\r\n for Windows files

OR

\2\n for Unix files

The View > Word Wrap option should work correctly !?

Best Regards,

guy038