find in files unicode text

Tudor Raneti

I search the text:
"
fisierul “sesizare comisia pentru libertăți civile, justiție și afaceri interne si probe.pdf” anexat la prezenta
"
with Total Commander. It finds it in multiple doc files. The text obviously contains diacritics and such

I search the same text with Notepad++, it doesn’t find it. Total Commander also doesn’t find it, unless I tick it’s “Unicode” check box in the search dialog

Anyway, I installed Notepad++ 7.2, same story. BTW notepad++ often goes “not responding” on these searches where Total commander doesn’t, in fact allowing me to stop the search anytime

P.S. There’s no such thing as unicode text huh? Then I guess both me and google are halucinating:
https://www.google.ro/search?q=unicode+text&ie=utf-8&oe=utf-8&client=firefox-b&gws_rd=cr&ei=4PJfWLeqMs7jwQKpzrzwDQ

Claudia Frank

@Tudor-Raneti

I will give it a try.

What is unicode?
First, Unicode is a stanard which assigns a unique number to every character.
Character not text.

What is encoding?
The sequence of bytes representing the number. One might use one byte per number,
as long as it is possible, where as others would use 2 or 3 or 4 bytes per number.

Why isn’t there such thing as unicode text?
Because this implies that the text (up to here we know characters are encoded) is encoded
in unicode but actually it is encoded in utf-8, utf-16 etc…

The same text, is stored different when using different codec, even so the unicode number
(code point) might be the same.

So when can those text be found?
When searching for text encoded in the same way as stored in file.

I didn’t check npp source code but I assume the find function uses the same encoding as
defined in current document. Maybe I’m wrong?

Cheers
Claudia

Btw. searching for unicode and text in google doesn’t prove validity.
First, it will search for unicode and/or text but not necessarily both values.
Enclosing both words would have decreased the results drastically but as said
doesn’t prove its validity.

Tudor Raneti

Nonsense and no answers

Claudia Frank

@Tudor-Raneti

well, if I’m wrong I would appreciate if you can clarify the nonsense.
If not for me maybe others could benefit from it, otherwise
it is nonsense to argument something is nonsense and don’t provide
facts why.

An answer is within the nonsense text - maybe you didn’t read completely.

Cheers
Claudia

Scott Sumner

Cheers to Claudia for even trying to help the truly rude! As a helper to so many with problems here…I would listen to every word provided!

Tudor Raneti

May I please know in which way is Claudia Frank and Scott Sumner related to Notepad++?

Claudia Frank

@Tudor-Raneti

May I please know in which way is Claudia Frank and Scott Sumner related to Notepad++?

Being npp users with the passion to help others.

Cheers
Claudia

guy038

Hi, Tudor Raneti

Please, slow down ! Your attitude is a bit offensive for people who just wants to help you ! Anyway, I’ll give it an other try, too !

From your text, below :

fisierul "sesizare comisia pentru libertăți civile, justiție și afaceri interne si probe.pdf" anexat la prezenta

I, simply, deduced that you’re a Roumanian subject and that your text could be translated, in English, by the approximate text ( Google translation ), below :

file "notification Committee on Civil Liberties, Justice and Home Affairs and probe.pdf" attached to this

With the default N++ font ( Courrier New ), two characters, of your alphabet, cannot be correctly displayed and are changed into the replacement character ( a small square character )

There are :

The LATIN SMALL LETTER S WITH COMMA BELOW ( Romanian ), with code-point = x{0219}
The LATIN SMALL LETTER T WITH COMMA BELOW ( Romanian ), with code-point = \x{021b}

Refer to the link , below :

http://www.unicode.org/charts/PDF/U0180.pdf

Do not confuse these two characters with these two others, which have a cedilla, instead of the comma, below the character and which are correctly displayed with the Courrier New font :

Letter ş , latin small letter s with cedilla, with code-point = \x{015F}
Letter ţ , latin small letter t with cedilla, with code-point = \x{0163}

Refer to the link :

http://www.unicode.org/charts/PDF/U0100.pdf

In a new tab, of my local N++ 7.2 configuration, which have the default Unicode encoding UTF-8 ( without BOM )

I, first, copied your text, a couple of times
Then, I select one copy of your text
I, immediately, opened the Find dialog ( Ctrl + F ) => The Find what field is already filled, with your text !
Finally, I clicked, a couple of times, on the Find Next button

No problem, it DID detect, successively, all the occurrences of your text :-))

This test has been correct :

With or without the Match whole word only option
With or without the Match case option
With or without the . matches newline option
In normal, extended and regular expression search mode

So…, may be, the current encoding of your file is not an Unicode encoding ? Just look, at the bottom right of the status bar !

Cheers,

guy038

gstavi

P.S. There’s no such thing as unicode text huh? Then I guess both me and google are halucinating:
https://www.google.ro/search?q=unicode+text&ie=utf-8&oe=utf-8&client=firefox-b&gws_rd=cr&ei=4PJfWLeqMs7jwQKpzrzwDQ

Google can find any idiocy on the net.
If you want to start understanding Unicode read this.
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Although the previous interaction does not show you as the learning type.

Andrew Dunn

Come now lets be civil.

To search, first go to your top menu and left click on Encoding. Then select an encoding. Now open your Find dialog and try it.

I suppose you’re asking for a small encoding list selection option in the Find dialog, correct? Or perhaps an option to search for multiple encodings at once? I suppose that’s faster than running multiple independent searches. Might eat up a lot of GUI space though.

gstavi

Out of curiosity I did browse the code to see how “find in files” works in Notepad++. As far as I understood:

NPP scans the directory tree and creates a list of all files that match the Filters.
For each file in the list NPP will:
– Load the ifile into NPP in “hidden” state. During that load NPP will guess the file’s unicode encoding. NPP may very well guess wrong.
– Search hidden file within NPP for required pattern.
– Add matching locations to ‘finder’.
– Close hidden file.

I did not go into the actual comparison but pattern should be matched to file using some common encoding. Either pattern is re-encoded into the guessed file encoding or during the scan each symbol is translated into its Unicode point of U+### and compared.
The bottom line is that if the initial encoding guess was wrong NPP will fail to find matches.

I use NPP mostly for UTF-8/ASCII files and sometimes for UTF-16 (Microsofts default). NPP auto detect these well.
I will not be surprised if rarer (country specific) encodings are often misdetected.

In any case this “find in files” scheme is very slow and I hardly use it. I prefer ‘greping’ externally.
It would be nice to implement for code developers a much more efficient UTF-8 only find in files that does not bother to load each file into Notepad++. Possibly in a plugin.