find in files unicode text

guy038

Hello, Tudor Raneti,

As I currently use the N++ v7.2 version, I downloaded the .7z archive of the ( old ! ) N++ version 6.3 and, after extraction, in a dummuy folder, I had a try, from one of my posts, where I spoke about Asiatic CJK characters. Refer to the link :

https://notepad-plus-plus.org/community/topic/12932/select-bookmarked-lines/15

And I could verify that the handle of Unicode characters, by coping/pastins, in the different fields of the Find dialog, was already correct with N++ 6.3, as you can see, from this picture, below, that I uploaded on the imgur site :

http://imgur.com/1vPMz2q

The Unicode characters that I copied and pasted, in the Find what: field, had Unicode code-point between \x{4000} and \x{9fa5}

So, you should provide additional information, in order to help you, in that matter ;-)

Best Regards,

guy038

Tudor Raneti

There isn’t any more information other than pointing out I’m pasting unicode strings that’s made up of unicode characters

I’m not searching for chinese characters but plain english text, and Notepad++ 6.3 can’t see it, where Total commander can

Claudia Frank

@Tudor-Raneti

Sorry, but I think you did not get the point. Guy038 uploaded an image which shows
that it is working, so he expects that you provide additional information about
what you do. How should he know why it is not working on your side if you do not
provide enough information so that one can do the same steps and either confirm
that you might have found a bug or give you advise what you are doing wrong.

Cheers
Claudia

Tudor Raneti

read the title. It sais “unicode text” not “unicode characters”. A string is not a character

It’s either that the textbox doesn’t understand unicode text I paste into it and transforms it to other encoding that doesn’t match, or the find function itself has some issue

Reverse engineer Total commander’s search and see the difference because there find works

Claudia Frank

@Tudor-Raneti

sorry - but I think you are still don’t get the point here.

First, there is no such thing as a unicode text.
It does not exist. I could try to explain but I’m already under the impression
it is not worth doing it.

Second, why do you think I should reverse engineer total commanders behavior about this?
You want help, so it is up to you to provide the informations which are asked in order to be able to help you.

Cheers
Claudia

Tudor Raneti

I search the text:
"
fisierul “sesizare comisia pentru libertăți civile, justiție și afaceri interne si probe.pdf” anexat la prezenta
"
with Total Commander. It finds it in multiple doc files. The text obviously contains diacritics and such

I search the same text with Notepad++, it doesn’t find it. Total Commander also doesn’t find it, unless I tick it’s “Unicode” check box in the search dialog

Anyway, I installed Notepad++ 7.2, same story. BTW notepad++ often goes “not responding” on these searches where Total commander doesn’t, in fact allowing me to stop the search anytime

P.S. There’s no such thing as unicode text huh? Then I guess both me and google are halucinating:
https://www.google.ro/search?q=unicode+text&ie=utf-8&oe=utf-8&client=firefox-b&gws_rd=cr&ei=4PJfWLeqMs7jwQKpzrzwDQ

Claudia Frank

@Tudor-Raneti

I will give it a try.

What is unicode?
First, Unicode is a stanard which assigns a unique number to every character.
Character not text.

What is encoding?
The sequence of bytes representing the number. One might use one byte per number,
as long as it is possible, where as others would use 2 or 3 or 4 bytes per number.

Why isn’t there such thing as unicode text?
Because this implies that the text (up to here we know characters are encoded) is encoded
in unicode but actually it is encoded in utf-8, utf-16 etc…

The same text, is stored different when using different codec, even so the unicode number
(code point) might be the same.

So when can those text be found?
When searching for text encoded in the same way as stored in file.

I didn’t check npp source code but I assume the find function uses the same encoding as
defined in current document. Maybe I’m wrong?

Cheers
Claudia

Btw. searching for unicode and text in google doesn’t prove validity.
First, it will search for unicode and/or text but not necessarily both values.
Enclosing both words would have decreased the results drastically but as said
doesn’t prove its validity.

Tudor Raneti

Nonsense and no answers

Claudia Frank

@Tudor-Raneti

well, if I’m wrong I would appreciate if you can clarify the nonsense.
If not for me maybe others could benefit from it, otherwise
it is nonsense to argument something is nonsense and don’t provide
facts why.

An answer is within the nonsense text - maybe you didn’t read completely.

Cheers
Claudia

Scott Sumner

Cheers to Claudia for even trying to help the truly rude! As a helper to so many with problems here…I would listen to every word provided!

Tudor Raneti

May I please know in which way is Claudia Frank and Scott Sumner related to Notepad++?

Claudia Frank

@Tudor-Raneti

May I please know in which way is Claudia Frank and Scott Sumner related to Notepad++?

Being npp users with the passion to help others.

Cheers
Claudia

guy038

Hi, Tudor Raneti

Please, slow down ! Your attitude is a bit offensive for people who just wants to help you ! Anyway, I’ll give it an other try, too !

From your text, below :

fisierul "sesizare comisia pentru libertăți civile, justiție și afaceri interne si probe.pdf" anexat la prezenta

I, simply, deduced that you’re a Roumanian subject and that your text could be translated, in English, by the approximate text ( Google translation ), below :

file "notification Committee on Civil Liberties, Justice and Home Affairs and probe.pdf" attached to this

With the default N++ font ( Courrier New ), two characters, of your alphabet, cannot be correctly displayed and are changed into the replacement character ( a small square character )

There are :

The LATIN SMALL LETTER S WITH COMMA BELOW ( Romanian ), with code-point = x{0219}
The LATIN SMALL LETTER T WITH COMMA BELOW ( Romanian ), with code-point = \x{021b}

Refer to the link , below :

http://www.unicode.org/charts/PDF/U0180.pdf

Do not confuse these two characters with these two others, which have a cedilla, instead of the comma, below the character and which are correctly displayed with the Courrier New font :

Letter ş , latin small letter s with cedilla, with code-point = \x{015F}
Letter ţ , latin small letter t with cedilla, with code-point = \x{0163}

Refer to the link :

http://www.unicode.org/charts/PDF/U0100.pdf

In a new tab, of my local N++ 7.2 configuration, which have the default Unicode encoding UTF-8 ( without BOM )

I, first, copied your text, a couple of times
Then, I select one copy of your text
I, immediately, opened the Find dialog ( Ctrl + F ) => The Find what field is already filled, with your text !
Finally, I clicked, a couple of times, on the Find Next button

No problem, it DID detect, successively, all the occurrences of your text :-))

This test has been correct :

With or without the Match whole word only option
With or without the Match case option
With or without the . matches newline option
In normal, extended and regular expression search mode

So…, may be, the current encoding of your file is not an Unicode encoding ? Just look, at the bottom right of the status bar !

Cheers,

guy038

gstavi

P.S. There’s no such thing as unicode text huh? Then I guess both me and google are halucinating:
https://www.google.ro/search?q=unicode+text&ie=utf-8&oe=utf-8&client=firefox-b&gws_rd=cr&ei=4PJfWLeqMs7jwQKpzrzwDQ

Google can find any idiocy on the net.
If you want to start understanding Unicode read this.
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Although the previous interaction does not show you as the learning type.

Andrew Dunn

Come now lets be civil.

To search, first go to your top menu and left click on Encoding. Then select an encoding. Now open your Find dialog and try it.

I suppose you’re asking for a small encoding list selection option in the Find dialog, correct? Or perhaps an option to search for multiple encodings at once? I suppose that’s faster than running multiple independent searches. Might eat up a lot of GUI space though.

gstavi

Out of curiosity I did browse the code to see how “find in files” works in Notepad++. As far as I understood:

NPP scans the directory tree and creates a list of all files that match the Filters.
For each file in the list NPP will:
– Load the ifile into NPP in “hidden” state. During that load NPP will guess the file’s unicode encoding. NPP may very well guess wrong.
– Search hidden file within NPP for required pattern.
– Add matching locations to ‘finder’.
– Close hidden file.

I did not go into the actual comparison but pattern should be matched to file using some common encoding. Either pattern is re-encoded into the guessed file encoding or during the scan each symbol is translated into its Unicode point of U+### and compared.
The bottom line is that if the initial encoding guess was wrong NPP will fail to find matches.

I use NPP mostly for UTF-8/ASCII files and sometimes for UTF-16 (Microsofts default). NPP auto detect these well.
I will not be surprised if rarer (country specific) encodings are often misdetected.

In any case this “find in files” scheme is very slow and I hardly use it. I prefer ‘greping’ externally.
It would be nice to implement for code developers a much more efficient UTF-8 only find in files that does not bother to load each file into Notepad++. Possibly in a plugin.