find two letters between quotes in lang tag

guy038

Hi, @pouemes44 and All,

Not really difficult !

SEARCH / MARK (?-i)(?<=\x20lang=")\l{2}(?=">)

Notes :

This regex searches for 2 lowercase letters \l{2} but ONLY IF :
- It is preceded with a space char and the string lang=", with this exact case, due to the look-behind (?<=\x20lang=")
- It is followed with the string ">, due to the look-ahead structure (?=">)

Best Regards,

guy038

Alan Kilborn

@guy038 said in find two letters between quotes in lang tag:

SEARCH / MARK (?-i)(?<=\x20lang=“)\l{2}(?=”>)

It seems overly restrictive to me.
OP mentions nothing about uppercase versus lowercase.
OP mentions “tags” but how are we to know what this really means for them?
Based upon OP’s specification, I would try:

(?-i)lang="[[:alpha:]]{2}"

Of course, still vague is the type of double-quotes we are talking about.

Pouemes44

thanks to you

guy38 it seems to work perfectly but i i have lang=“fr” in all my page
how to exclude from the search lang=“fr”
Alan it took tag with 3 letters

Alan Kilborn

@Pouemes44 said in find two letters between quotes in lang tag:

Alan it took tag with 3 letters

Hmmm. Really not sure how you could get that result, reference:

Even though you didn’t have luck with mine, here’s how I’d change mine to exclude matching fr :

(?-i)lang="(?!fr)[[:alpha:]]{2}"

Pouemes44

thanks Alan yes work like this
perhaps firt time i forget the last quote, that why it was not correct
a great thanks

astrosofista

@Pouemes44 said in find two letters between quotes in lang tag:

how to exclude from the search lang=“fr”

Try this instead:

(?-i)(?<=\x20lang=")(?!fr)\l{2}(?=">)

Take care and have fun!

guy038

Hi, @pouemes44, @alan-kilborn, @astrosofista and All,

Alan, Hum…, interesting ! I assume, since my last post, that @pouemes44 was talking about the HTML lang attribute. So, I dug out a bit on Net !

And, from these links :

We can deduce that :

Language codes have always 2 or 3 lowercase characters ( Refer ISO 639-2 from Wikipedia )
Country codes have always 2 uppercase characters, ( Refer ISO 3166-1 alpha-2 code list, from Wikipedia )
A Language code stands by itself OR may be followed with a dash - and a country code OR a script code ( Refer ISO Language Code Table )
Generally, Language tags are lowercase, alphabetic region subtags are uppercase, and script tags begin with an initial capital ( Refer https://www.w3.org/International/articles/language-tags/#rfc )

So from the main list, below ( Refer http://www.lingoes.net/en/translator/langcode.htm ), with 241 items :

af          Afrikaans
af-ZA       Afrikaans (South Africa)
ar          Arabic
ar-AE       Arabic (U.A.E.)
ar-BH       Arabic (Bahrain)
ar-DZ       Arabic (Algeria)
ar-EG       Arabic (Egypt)
ar-IQ       Arabic (Iraq)
ar-JO       Arabic (Jordan)
ar-KW       Arabic (Kuwait)
ar-LB       Arabic (Lebanon)
ar-LY       Arabic (Libya)
ar-MA       Arabic (Morocco)
ar-OM       Arabic (Oman)
ar-QA       Arabic (Qatar)
ar-SA       Arabic (Saudi Arabia)
ar-SY       Arabic (Syria)
ar-TN       Arabic (Tunisia)
ar-YE       Arabic (Yemen)
az          Azeri (Latin)
az-AZ       Azeri (Latin) (Azerbaijan)
az-AZ       Azeri (Cyrillic) (Azerbaijan)
be          Belarusian
be-BY       Belarusian (Belarus)
bg          Bulgarian
bg-BG       Bulgarian (Bulgaria)
bs-BA       Bosnian (Bosnia and Herzegovina)
ca          Catalan
ca-ES       Catalan (Spain)
cs          Czech
cs-CZ       Czech (Czech Republic)
cy          Welsh
cy-GB       Welsh (United Kingdom)
da          Danish
da-DK       Danish (Denmark)
de          German
de-AT       German (Austria)
de-CH       German (Switzerland)
de-DE       German (Germany)
de-LI       German (Liechtenstein)
de-LU       German (Luxembourg)
dv          Divehi
dv-MV       Divehi (Maldives)
el          Greek
el-GR       Greek (Greece)
en          English
en-AU       English (Australia)
en-BZ       English (Belize)
en-CA       English (Canada)
en-CB       English (Caribbean)
en-GB       English (United Kingdom)
en-IE       English (Ireland)
en-JM       English (Jamaica)
en-NZ       English (New Zealand)
en-PH       English (Republic of the Philippines)
en-TT       English (Trinidad and Tobago)
en-US       English (United States)
en-ZA       English (South Africa)
en-ZW       English (Zimbabwe)
eo          Esperanto
es          Spanish
es-AR       Spanish (Argentina)
es-BO       Spanish (Bolivia)
es-CL       Spanish (Chile)
es-CO       Spanish (Colombia)
es-CR       Spanish (Costa Rica)
es-DO       Spanish (Dominican Republic)
es-EC       Spanish (Ecuador)
es-ES       Spanish (Castilian)
es-ES       Spanish (Spain)
es-GT       Spanish (Guatemala)
es-HN       Spanish (Honduras)
es-MX       Spanish (Mexico)
es-NI       Spanish (Nicaragua)
es-PA       Spanish (Panama)
es-PE       Spanish (Peru)
es-PR       Spanish (Puerto Rico)
es-PY       Spanish (Paraguay)
es-SV       Spanish (El Salvador)
es-UY       Spanish (Uruguay)
es-VE       Spanish (Venezuela)
et          Estonian
et-EE       Estonian (Estonia)
eu          Basque
eu-ES       Basque (Spain)
fa          Farsi
fa-IR       Farsi (Iran)
fi          Finnish
fi-FI       Finnish (Finland)
fo          Faroese
fo-FO       Faroese (Faroe Islands)
fr          French
fr-BE       French (Belgium)
fr-CA       French (Canada)
fr-CH       French (Switzerland)
fr-FR       French (France)
fr-LU       French (Luxembourg)
fr-MC       French (Principality of Monaco)
gl          Galician
gl-ES       Galician (Spain)
gu          Gujarati
gu-IN       Gujarati (India)
he          Hebrew
he-IL       Hebrew (Israel)
hi          Hindi
hi-IN       Hindi (India)
hr          Croatian
hr-BA       Croatian (Bosnia and Herzegovina)
hr-HR       Croatian (Croatia)
hu          Hungarian
hu-HU       Hungarian (Hungary)
hy          Armenian
hy-AM       Armenian (Armenia)
id          Indonesian
id-ID       Indonesian (Indonesia)
is          Icelandic
is-IS       Icelandic (Iceland)
it          Italian
it-CH       Italian (Switzerland)
it-IT       Italian (Italy)
ja          Japanese
ja-JP       Japanese (Japan)
ka          Georgian
ka-GE       Georgian (Georgia)
kk          Kazakh
kk-KZ       Kazakh (Kazakhstan)
kn          Kannada
kn-IN       Kannada (India)
ko          Korean
ko-KR       Korean (Korea)
kok         Konkani
kok-IN      Konkani (India)
ky          Kyrgyz
ky-KG       Kyrgyz (Kyrgyzstan)
lt          Lithuanian
lt-LT       Lithuanian (Lithuania)
lv          Latvian
lv-LV       Latvian (Latvia)
mi          Maori
mi-NZ       Maori (New Zealand)
mk          FYRO Macedonian
mk-MK       FYRO Macedonian (Former Yugoslav Republic of Macedonia)
mn          Mongolian
mn-MN       Mongolian (Mongolia)
mr          Marathi
mr-IN       Marathi (India)
ms          Malay
ms-BN       Malay (Brunei Darussalam)
ms-MY       Malay (Malaysia)
mt          Maltese
mt-MT       Maltese (Malta)
nb          Norwegian (Bokm?l)
nb-NO       Norwegian (Bokm?l) (Norway)
nl          Dutch
nl-BE       Dutch (Belgium)
nl-NL       Dutch (Netherlands)
nn-NO       Norwegian (Nynorsk) (Norway)
ns          Northern Sotho
ns-ZA       Northern Sotho (South Africa)
pa          Punjabi
pa-IN       Punjabi (India)
pl          Polish
pl-PL       Polish (Poland)
ps          Pashto
ps-AR       Pashto (Afghanistan)
pt          Portuguese
pt-BR       Portuguese (Brazil)
pt-PT       Portuguese (Portugal)
qu          Quechua
qu-BO       Quechua (Bolivia)
qu-EC       Quechua (Ecuador)
qu-PE       Quechua (Peru)
ro          Romanian
ro-RO       Romanian (Romania)
ru          Russian
ru-RU       Russian (Russia)
sa          Sanskrit
sa-IN       Sanskrit (India)
se          Sami (Northern)
se-FI       Sami (Northern) (Finland)
se-FI       Sami (Skolt) (Finland)
se-FI       Sami (Inari) (Finland)
se-NO       Sami (Northern) (Norway)
se-NO       Sami (Lule) (Norway)
se-NO       Sami (Southern) (Norway)
se-SE       Sami (Northern) (Sweden)
se-SE       Sami (Lule) (Sweden)
se-SE       Sami (Southern) (Sweden)
sk          Slovak
sk-SK       Slovak (Slovakia)
sl          Slovenian
sl-SI       Slovenian (Slovenia)
sq          Albanian
sq-AL       Albanian (Albania)
sr-BA       Serbian (Latin) (Bosnia and Herzegovina)
sr-BA       Serbian (Cyrillic) (Bosnia and Herzegovina)
sr-SP       Serbian (Latin) (Serbia and Montenegro)
sr-SP       Serbian (Cyrillic) (Serbia and Montenegro)
sv          Swedish
sv-FI       Swedish (Finland)
sv-SE       Swedish (Sweden)
sw          Swahili
sw-KE       Swahili (Kenya)
syr         Syriac
syr-SY      Syriac (Syria)
ta          Tamil
ta-IN       Tamil (India)
te          Telugu
te-IN       Telugu (India)
th          Thai
th-TH       Thai (Thailand)
tl          Tagalog
tl-PH       Tagalog (Philippines)
tn          Tswana
tn-ZA       Tswana (South Africa)
tr          Turkish
tr-TR       Turkish (Turkey)
tt          Tatar
tt-RU       Tatar (Russia)
ts          Tsonga
uk          Ukrainian
uk-UA       Ukrainian (Ukraine)
ur          Urdu
ur-PK       Urdu (Islamic Republic of Pakistan)
uz          Uzbek (Latin)
uz-UZ       Uzbek (Latin) (Uzbekistan)
uz-UZ       Uzbek (Cyrillic) (Uzbekistan)
vi          Vietnamese
vi-VN       Vietnamese (Viet Nam)
xh          Xhosa
xh-ZA       Xhosa (South Africa)
zh          Chinese
zh-CN       Chinese (State)
zh-Hans     Chinese (Simplified Han Script)
zh-Hant     Chinese (Traditional Han Script)
zh-HK       Chinese (Hong Kong)
zh-MO       Chinese (Macau)
zh-SG       Chinese (Singapore)
zh-TW       Chinese (Taiwan)
zu          Zulu
zu-ZA       Zulu (South Africa)

This new regex version matches all the possible language codes :

SEARCH / MARK (?-i)(?<=\x20lang=")(?:zh\-Han(s|t)|\l{2,3}(-\u{2})?)(?=">?)

Now, in order to omit the two "fr" and "fr-FR" languages, only, prefer the regex, below :

SEARCH / MARK (?-i)(?<=\x20lang=")(?:zh\-Han(s|t)|(?!fr(-FR)?">?)\l{2,3}(-\u{2})?)(?=">?)

You may test these two regexes against the list above !

Best Regards,

guy038

Pouemes44

thanks to all
yes i am trying to find in my pages the iso with two letters which refer generally to ISO 693-1

because i think i havesome mistakes

example lingala
ISO 639-1 ln
ISO 639-2 lin
ISO 639-3 lin
IETF ln

not easy to know which language code

http://www.language-archives.org/language/lin
and
https://www.ethnologue.com/language/lin

use ISO 693-3

Alan Kilborn

@guy038

Well… You can read into a poster’s request as much as you want, and go off and research a poster’s problem, again, as much as you want. :-)
I’m sure there might be some interesting “finds” along such a journey.

I don’t mind helping with regex requests (except from the “takers”), but I’m sticking to what is asked for, and I’m not going to infer a bunch of stuff. My goal is “get them on their way” quickly. Just my take on it.

Also, if we solve the problem they ask for, and it isn’t the problem they have, perhaps they learn to be better askers?

But, I didn’t exactly solve the problem that was asked for: “find all lines…”. Really then the hit should have consisted of a full line, right? Well, we have some wiggle room here, as a “Find All…” search provides the whole line data requested.

guy038

Hi, @pouemes44, @alan-kilborn, @astrosofista and All,

@pouemes44, the last regex, of my previous post, finds any language code, of 2 or 3 lowercase letters, optionnally followed with a dash and a country code of two uppercase letters, different from, either, "fr" and "fr-FR" OR finds the specific zh-Hans and zh-Hant Chinese syntaxes

Now, if we assume, as a matter of principle, that the language codes are all correct, in your files, the search of these language codes are more simple ! Indeed, as no control about syntax is needed, this regex, below, should be enough ( The two language codes "fr" and "fr-FR" are not taken in account ! )

SEARCH / MARK (?-is)(?<=\x20lang=")(?!fr"|fr-FR").+?(?=">?)

And, as @alan-kilborn said, if you prefer to highlight the entire lines, with their EOL chars, containing a language code, use that regex :

MARK (?-is)^.*\x20lang="(?!fr"|fr-FR").+\R?

Which looks for entire lines, EOL included, containing, at least, a space char, followed by a string lang", with this case, and followed with a valid language code, different from, either, "fr" and "fr-FR"

Finally, if you just need to bookmark the lines containing a global HTML attribute lang"..........", containing a valid language code, different from, either, "fr" and "fr-FR", use this final regex :

MARK (?-i)\x20lang="(?!fr"|fr-FR")

BR

guy038

Pouemes44

Great thanks Guy for all you explanations which are precious, and could be precious to next search
Thanks Alan too

***Is there someone here who could be able to make a working extention like “toolbucket” able to search and replace in folders… it should be a great extention

Alan Kilborn

@Pouemes44 said in find two letters between quotes in lang tag:

Is there someone here who could be able to make a working extention like “toolbucket” able to search and replace in folders… it should be a great extention

Have you tried Replace in Files on the Find in Files tab of the Find window?

Pouemes44

Hello Alan yes of course
but when i must search and replace multi lines, its not very easy with a sow little windows and must always use regular expresions for lines break, so with and extension it will be super