Is it possible...?
-
Hello, @dipsi7772, @peterjones, @terry-r and All,
First, Peter said :
@guy038 is going to have to jump in to do it in one fell swoop. Barring that (he may be busy, or away, or not interested in this one)
Indeed, I was away for a long week-end !
Secondly, @dipsi7772 said :
This is a original line of the file:
/><br />password<br />target</div>
So I assume that the different targets are characters between the nearest
>
and<
symbols, after the wordpassword
, with this exact case, followed with few characters. In this case, here is a destructive method, which need to be run only on a copy of your bunch of.html
filesTo sump up :
-
Copy the directory, containing all your
.html
files, to an other location -
Start Notepad++
-
Open the Search in Files dialog (
Ctrl + Shift + F
) -
SEARCH
(?s-i).*?password.+?>(.+?)(?=<)|.+
-
REPLACE
\1\r\n
( or\1\n
if your files use Unix EOL syntax ) -
FILTERS
*.html
-
DIRECTORY The absolute location of the COPY of all your
.html
files ( Do NOT use your original files ) -
Tick the
Wrap around
button -
Select the
Regular expression
search mode -
Click on the
Replace in Files
button -
Click on the
Yes
button of the small dialog Are you sure?
=> Each copy, of an original
.html
file, should have been drastically decreased and simply contains a list of all passwords, one per line, contained in the original file ;-))Notes :
-
As usual, the
(?s-i)
in-line modifiers mean that :-
The dot
.
character represents any single character, even EOL chars ((?s)
) -
The search is processed in a sensitive to case way (
(?-i)
)
-
-
Then, The part
.*?password.+?>
looks, from beginning of each file, for the shortest range of any character, till the stringpaasword
, with this exact case, followed with some characters till the nearest>
symbol -
Now, the
(.+?)(?=<)
part stores, as group1
, all the subsequent characters till the condition contained in the look-around structure is true, i.e. till the nearest<
symbol if found -
In replacement, the “value” of each password
\1
is simply rewritten, followed with new-line characters (\r\n
or\n
) -
At the end, if the word
password
, with this exact case, cannot be found, the second alternative.+
, after the alternation symbol|
, selects all the remaining characters of current scanned file and deletes them, because group1
is not defined
Remarks :
-
Notice that all quantifiers of the first alternative of the search regex are lazy quantifiers, i. e. it grasps as little chars as possible, though satisfying the subsequent parts of the overall regex
-
When an
.html
file do not contain any wordpassword
, all its contents are just replaced with a single line-break !
Best Regards,
guy038
-
-
Dear @Terry-R dear @guy038, dear @PeterJones
thank you so much for your support. All of you seem to be very smart people if I read through your thoughts and realize how difficult it is get the target out of the big text.As I dont want to make you guys headache I have checked now several different files of the big bunch to find out a “rule” or a repeated “sign” which is situated around the target.
I found that there is always a </div> directly next to the target. But unfortunately there are also other lines with </div>, but without the target.
So at the end, the only fix mark is the word “pass” or the word “password” before the target. Unfortunately I have to inform about a further issue:
The word “password” is not everytime directly before the target … sometimes there is also letters numbers signs or free spaces between them.
Regarding your question: The target can contain every, sign,letter, special sign …everything you can imagine.So from my unprofessional point of view:
its just possible to extract just the line which contains the strings “pass” “password” and “</div>”.
As I have almost 3000 html files to search, just extract the lines would also help me a lot.Should I try your proposals now or have you any remarks?
I googled but still not sure about the meaning of Unix EOLSearch field should look like this?
Not sure about the search options you mentioned.
Thanks for all guys!!! -
@dipsi7772 You have the simple search tab open, and you need the following tabs of the search window, replace or find and replace in files
-
Hi, @dipsi7772, @peterjones, @terry-r, @andrecool-68 and All,
@dipsi7772, many thanks for trying to find out some general rules in order to isolate your different target strings more easily !
You said :
1 I found that there is always a </div> directly next to the target. But unfortunately there are also other lines with </div>, but without the target.
2 So at the end, the only fix mark is the word “pass” or the word “password” before the target. Unfortunately I have to inform about a further issue:
3 The word “password” is not everytime directly before the target … sometimes there is also letters numbers signs or free spaces between them.
4 Regarding your question: The target can contain every, sign,letter, special sign …everything you can imagine.
If so, I think that the new regex S/R , below, should meet these
4
criteria !SEARCH
(?s).*?(?-si:pass(word)?.+?>(.+?)(?=</div>))|.+
REPLACE
\2\r\n
( or\2\n
if your files use Unix EOL syntax )-
The look-around
(?=</div>)
, as well as the(?-si:pass(word)?...
syntax, should satisfy your first criterion -
The part
(?-si:pass(word)?...)
, which matches the wordpass
orpassword
, with this exact case, satisfies your second criterion -
Then, the part
.+?>
, which matches the shortest range of standard chars, till a>
symbol, satisfies your third criterion -
Finally, the part
(.+?)
, which stores, as group1
, any range of standard characters, due to the in-line modifiers(?-si:....)
, till the nearest string</div>
excluded ( located right after all the target characters ), satisfies your fourth criterion
As @andrecool-68 said, use the Find in Files dialog (
Ctrl + Shift + F
)Cheers,
guy038
P.S. :
So, except for the SEARCH and REPLACE updated zones, just follow the instructions given in my first post !
-
-
Im soooo happy =) this works … Thank you all!! This made my day :)
You are great !! Thanks so much=) -
Hi, @dipsi7772, @peterjones, @terry-r, @andrecool-68 and All,
@dipsi7772, you said :
I googled but still not sure about the meaning of Unix EOL
I missed that sentence. Refer to the link, below, for general information about new-line definition :
https://en.wikipedia.org/wiki/Newline#Representation
To be rigorous, I’ve made a slight error, in my last proposed search regex ! In order to extract target from lines of that form :
/><br />password>target</div> OR /><br />pass>target</div> ( when NO character exists between the string
pass(word)
and>target<
)I should have used the following search regex, with a
*
, instead of a+
, at the indicated place !SEARCH (?s).*?(?-si:pass(word)?.*?>(.+?)(?=</div>))|.+ ▲ │
Best Regards,
guy038
-
@guy038 said in Is it possible...?:
OT to the main thread here, but it is interesting from that article that the Mac line-endings (carriage-return only) that Notepad++ still supports are for an OS whose last release was before the first release of N++ (if I have my dates straight).
-
Dear guy038,
thanks for clarification. Your proposal works good. Anyway its still a job to manuallly copy the eexpressions out of the sentences :) but its way better than without the code.
Strange thin is, on some files. just <!doctype html> this is the result.
I would say al files are the same format , so Its mystious.
Anyway whis you a nice xMas all community members and thanks again=) -
Is it maybe possible to implement a rule that avoids duplicate results?
Another thing is, even I choose “Automatischer Zeienumbruch” each line is writte ine ONE line, not in a second one which woud avoid the vertical scrolling.
Best Regards Friends
-
Hi, @dipsi7772 and All,
You said :
Strange thin is, on some files. just <!doctype html> this is the result.
I would say al files are the same format , so Its mystious.It’s not strange and it’s not related to file format at all ! The reason, is that , for files with big size, it may happen that the regex does not work properly and deletes all characters but the first line of your
HTML
files. Indeed, as the regex is :(?s).*?(?-si:pass(word)?.*?>(.+?)(?=</div>))|.+
The beginning
(?s).*?(?-si:pass(word)?
means that the regex engine selects all characters, even displayed on several lines, from current position of the caret till the first wordpass
orpassword
. In some files, this range of characters can be significative and this fact could explain the non-expected results !If your
HTML
files are not important nor confidentiel, simply e-mail me one of these files, which produces errors. I’ll try to find out an other regex which works correctly, in all cases ;-))
Next, you said :
Is it maybe possible to implement a rule that avoids duplicate results?
My question is : In the copied
HTML
files, that contains the passwords (1
per line ), which is the maximum length of these files ?Depending of this length, a regex solution may be possible… However, if you don’t mind changing the initial order of these passwords, just use, for each copied
HTML
file, the two menu options, below :-
Edit > Line Operations > Sort Line Lexicographically Ascending
-
Edit > Line Operations > Remove Consecutive Duplicate Lines
Finally, you said :
Another thing is, even I choose “Automatischer Zeienumbruch” each line is writte ine ONE line, not in a second one which woud avoid the vertical scrolling.
I’m sorry because I cannot guess what you’re speaking of :-(( Depending on your file ending characters, discussed previously, and using the appropriate Replace regex :
\2\r\n
for Windows filesOR
\2\n
for Unix filesThe
View > Word Wrap
option should work correctly !?Best Regards,
guy038
-