@guy038 said in Search++: A work in progress:
So, if I understand you clearly, we need to transform the selection(s) in Marked Text, first and then use the Find in Mark Text option
Yes; or click the Tools button, open Settings and check Convert selections to marked text before beginning a stepwise search to have Search++ do it automatically. Otherwise, multiple searches that don’t affect the selection (like Count or Find All or Replace All) will work within the selection, but only the first stepwise Find (or the preliminary find in a stepwise Replace) will be constrained to the selection, since after that the original selection will be gone.
When clicking on the Regex button, do we use your Unicode search engine, as in Columns++ or is it a mix of the Columns++ version and ICU
It’s the Columns++ search engine, except for one thing. Previously I could not figure out how to incorporate ICU4C into the plugin, so for Columns++ I devised a Python program that reads several of the Unicode character data files and writes C++ code that compiles into a gigantic table containing the information I needed. I stumbled on the way to use ICU4C shortly before I began working on Search++; instead of building and using those tables, I go straight to ICU4C for information (questions like, “What is the general category of this character?” or ”Is this a lower case character?”).
It might turn out that this will have an efficiency impact (better or worse? — I don’t know). It should fix some of the errors in Columns++, like [[:lower:]] missing characters that are lower case but not letters.
Oddly, if we choose the ICU button, the Replace and Replace All buttons are not greyed and seem functional, contrary to what you said ?!
They’re not disabled, but all they do is return the message, “Command not implemented.”
Can you recommend a few websites, speaking about ICU and the Unicode Word Boundaries specificity ?
I don’t really have anything except the Unicode documentation. In my brief testing, the practical effect in English is that words like can't are recognized as a single word. Most regular expression engines define a word boundary (\b) in terms of what is a word character (\w). The regular expression engine in ICU lets you do that, but it also provides an option to use Unicode word boundaries to define \b.
Presently, when hitting the ICU button, do searches like \p{alphabetic} or \p[XID_Continue} are possible against my Total_Chars file of 325,590 characters ?
Yes. You can even use things like \p{script=Greek}. Unfortunately, I haven’t been able to find any place where ICU documents its own regular expression syntax. The regular-expressions.info web site includes ICU among the regex dialects it shows.