Regex for dictionary entries
-
Hi
I’m looking for a regex that would take the main entry at the beginning of the line and replace it with the tilde character following any numeral and finally put it in a separate line, together with the definition(s), and another one that would delete all the usage examples and their translations, set in either in bold or italic, while keeping the rest intact.
Hence:
bankrupt [tab] 1. blah blah 2. blah blah 3. blah blah
bankruptly [tab] müflisane, iflas ederek/etmişçesine
(Assuming each main entry is followed by a tab)Could someone provide me with one?
Many thanks in advance! -
I can’t edit my previous post. I meant “headword” instead of “main entry”.
-
Hello, @glossar,
Regular expressions are very powerful, indeed ! But, unfortunately, cannot detect bold and italic variations of a font :-(( So, we’ll have to find out common boundaries for theses zones !
Now, as I cannot really exploit your two pictures ( not true text ! ), I advice you to post an example of your initial text and the resulting text that you expect to, after one or several consecutive regex S/R !
Simply, use this syntax, to get raw text, not processed in any way :
~~~diff
From the INITIAL text :Your
…
text
…
hereI would like this EXPECTED text :
Your
…
changed
…
text
~~~which will be displayed as :
From the INITIAL text : Your ... text ... here I would like this EXPECTED text : Your ... changed ... text
If you prefer, just send me some of your examples, by e-mail, to :
But, please, associate each exact AFTER text, that you want, to its exact BEFORE text, that you get ;-))
Remember that sometimes, an additional or a missing single character may cause regular expressions to fail !
See you later,
Best regards,
guy038
-
Hi guy,
Thank you for reply. Below I’ve posted the texts the way you told. It is my bad that I didn’t post a workable text, but screenshots for visual convenience. The original file is a pdf, scanned from a hard copy of a bilingual dictionary, which is further converted to a Word file, which is further saved as a html/txt file for processing. There are still repeated patterns to a usable degree in the converted Word file and the html/txt one, while the said conversation introduced falsely OCR-ed characters and distorted the format a bit. I’m aware that formatting gets lost in plain text and that regex has nothing to do with formatting. In my previous posts I forgot to mention that the first screenshot was taken from the word life and I would welcome any (sort of) regex that I could also implement within Word, in combination with format selection, hence my mentioning the formatting.BTW, the numbers for suffixes (-ly, -ness, -ship, etc.) before the tilde character is either one or two digits.
From the INITIAL text: bankrupt [tab] 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, to go ~ : batmak, iflâs etmek, to be ~ : yoksun olmak. He seems to be ~ of all kind feelings : Her türlü asil duygulardan yoksun görünüyor. fraudulent/negli-gent - : kötü niyetli batkın, hileli müflis, 2. meteliksiz, 3. yoksun, mahrum, düşkün, an ~ intellectual: fikir yoksunu. ~ of intelligence : akılsız, a moral ~ : ahlâk düşkünü, to be - of manners : terbiyeden yoksun olmak, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. His embezzlement ~ed the company : Zimmetine para geçirmesi şirketi batırdı/iflâs ettirdi. 5. ~ly : müflisane, iflâs ede-rek/etmişçesine. I would like this EXPECTED text preferably: :) bankrupt [tab] 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, 2. meteliksiz, 3. yoksun, mahrum, düşkün 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. bankruptly [tab] müflisane, iflâs ede-rek/etmişçesine. if the above one is not possible, then I would like this EXPECTED text: :) bankrupt [tab] 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, to go ~ : batmak, iflâs etmek, to be ~ : yoksun olmak. He seems to be ~ of all kind feelings : Her türlü asil duygulardan yoksun görünüyor. fraudulent/negli-gent - : kötü niyetli batkın, hileli müflis, 2. meteliksiz, 3. yoksun, mahrum, düşkün, an ~ intellectual: fikir yoksunu. ~ of intelligence : akılsız, a moral ~ : ahlâk düşkünü, to be - of manners : terbiyeden yoksun olmak, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. His embezzlement ~ed the company : Zimmetine para geçirmesi şirketi batırdı/iflâs ettirdi. bankruptly [tab] müflisane, iflâs ede-rek/etmişçesine.
-
Hi, @glossar and All,
Thanks for following my advice which allows anyone of us to grab your plain text !
I’ve already figured out how to do, with two regexes, but I still need some pieces of information
-
A) : Are all the present headword’s definitions, placed one after another, without any line-break, between or else ?
-
B) : Do the present headwords always begin the lines or could it be some blank characters, before each headword ?
-
C) : Right after each headword, is the present sequence of characters always :
-
TABULATION
char + the string1.
+ aSPACE
char + definition(s)… ( case of bankrupt header ) -
TABULATION
char + definition(s)… ( case of bankruptly header )
-
or anything else ? For instance spaces before and/or after the tabulation character ?
BR
guy038
-
-
Hi guy,
Thank you.
To answer your questions:-
A) : No, but you can assume that they are so, since the majority of them are placed so. I can gladly ignore and delete the ones in the end, which don’t follow the pattern in question, or I might visually go through and fix them manually if it would be worth it. But just in case you could do a magic with regex and fix the ones with a line-break between as well, there are some entries like below (again, due to the distortion/loss arisen from the conversion):
bankrupt[tab]1. huk. batkın, müflis, batmış, iflâs etmiş, bor
çlarını ödeyemeyen kimse, to go ~ : batmak, iflâs etmek, to be ~ : yoksun olmak. He seems to be ~ of all kind feelings : Her türlü asil duygulardan yoksun görünüyor. fraudulent/negli-gent - : kötü niyetli batkın, hileli müflis,
2. meteliksiz, 3. yoksun, mahrum, düşkün, an ~ intellectual: fikir
yoksunu. ~ of intelligence : akılsız, a moral ~ : ahlâk
düşkünü, to be - of manners : terbiyeden yoksun olmak, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. His embezzlement ~ed the company : Zimmetine para geçirmesi şirketi batırdı/iflâs ettirdi. 5. ~ly : müflisane, iflâs ede-rek/etmişçesine. -
B) : Yes, the headwords always begin the lines.
-
C) :
TABULATION + (0 space/char) + 1. + (0 or 1 or more chars/spaces) + definition(s)… ( case of bankrupt header )
TABULATION + (0 space/char) + (0 or 1 number followed by a dot) + (0 or 1 or more chars/spaces) + definition(s)… ( case of bankruptly header )
-
-
Hi, @glossar and All,
Thanks for your additional hints ! So, here is my first try ! I will consider the TEST text, below :
bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, bor çlarını ödeye meyen kimse, to go ~ : batmak, iflâs etmek, to be ~ : yoksun olmak. He seems to be ~ of all kind feelings : Her türlü asil duygulardan yoksun görünüyor. fraudulent/negli-gent - : kötü niyetli batkın, hileli müflis, 2. meteliksiz, 3. yoksun, mahrum, düşkün, an ~ intellectual: fikir yoksunu. ~ of intelligence : akılsız, a moral ~ : ahlâk düşkünü, to be - of manners : terbiyeden yoksun olmak, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. His embezzlement ~ed the company : Zimmetine para geçirmesi şirketi batırdı/iflâs ettirdi. 5. ~ly : müflisane, iflâs ede-rek/etmişçesine. 6. ~able : müflisane, iflâs ede-rek/etmişçesine. bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, to go ~ : batmak, iflâs etmek, to be ~ : yoksun olmak. He seems to be ~ of all kind fee lings : Her türlü asil duygulardan yoksun görünüyor. fraudulent/negli-gent - : kötü niyetli batkın, hileli müflis, 2. meteliksiz, 3. yoksun, mahrum, düşkün, an ~ intellectual: fikir yoksunu. ~ of intelligence : akılsız, a moral ~ : ahlâk düşkünü, to be - of manners : terbiyeden yoksun olmak, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. His embezzlement ~ed the company : Zimmetine para geçirmesi şirketi batırdı/iflâs ettirdi. 5. ~ly : müflisane, iflâs ede-rek/etmişçesine, to be ~ : yoksun olmak. 6. ~able : müflisane, iflâs ede-rek/etmişçesine, to be ~ : yoksun olmak.
In that TEST text, @glossar, you’ll notice several particularities :
-
I duplicated the
bankrupt
header word, with some line-breaks, between, to simulate a second header word, below the first one ! -
In the first
bankrupt
header word, I decided to split text, that you want to keep, twice. So you get the text :
bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, bor çlarını ödeye meyen kimse, to go ~ : batmak, iflâs etmek,.......
- In the second
bankrupt
header word, I decided to split text, that you want to get rid of, once. So you get the text :
bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, to go ~ : batmak, iflâs etmek, to be ~ : yoksun olmak. He seems to be ~ of all kind fee lings : Her türlü asil duygulardan yoksun görünüyor........
- In the two
bankrupt
header words, I added, at the end of the definitions, the part :
6. ~able : müflisane, iflâs ede-rek/etmişçesine.
to simulate a third header word
bankruptable
( BTW, from DSpellcheck it’s not a correct English word ! )- Finally, in the second
bankrupt
header word, I also added, at the end of the5.
and new6.
definitions, the following rubbish text :
to be ~ : yoksun olmak.
to simulate a part of text which we must to get rid of !
Now, let’s go, modifying that text, correctly ;-))
-
Move back to the very beginning of your words list (
Ctrl + Home
) -
Open the Replace dialog (
Ctrl + H
) -
Uncheck the
Wrap around
option -
SEARCH
(\R)\R*(?=\w+\t)|\R(?=[^\t\r\n]+\R)
-
REPLACE
?1\1
-
Click ONCE on the
Replace All
button
This first regex S/R will perform two things :
-
It will delete any line-break between header words
-
It will delete any additional line-break, wrongly added during the conversion phase
So, you get the following text :
bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, to go ~ : batmak, iflâs etmek, to be ~ : yoksun olmak. He seems to be ~ of all kind feelings : Her türlü asil duygulardan yoksun görünüyor. fraudulent/negli-gent - : kötü niyetli batkın, hileli müflis, 2. meteliksiz, 3. yoksun, mahrum, düşkün, an ~ intellectual: fikir yoksunu. ~ of intelligence : akılsız, a moral ~ : ahlâk düşkünü, to be - of manners : terbiyeden yoksun olmak, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. His embezzlement ~ed the company : Zimmetine para geçirmesi şirketi batırdı/iflâs ettirdi. 5. ~ly : müflisane, iflâs ede-rek/etmişçesine. 6. ~able : müflisane, iflâs ede-rek/etmişçesine. bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, to go ~ : batmak, iflâs etmek, to be ~ : yoksun olmak. He seems to be ~ of all kind feelings : Her türlü asil duygulardan yoksun görünüyor. fraudulent/negli-gent - : kötü niyetli batkın, hileli müflis, 2. meteliksiz, 3. yoksun, mahrum, düşkün, an ~ intellectual: fikir yoksunu. ~ of intelligence : akılsız, a moral ~ : ahlâk düşkünü, to be - of manners : terbiyeden yoksun olmak, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. His embezzlement ~ed the company : Zimmetine para geçirmesi şirketi batırdı/iflâs ettirdi. 5. ~ly : müflisane, iflâs ede-rek/etmişçesine, to be ~ : yoksun olmak. 6. ~able : müflisane, iflâs ede-rek/etmişçesine, to be ~ : yoksun olmak.
The second regex S/R, below, will create the two new headers
bankruptly
andbankruptable
, after eachbankrupt
header word :-
SEARCH
(?s)(\w+)\t[^\t]+\K\x20\d+\.\x20~(\w+)\x20:
-
REPLACE
\r\n\1\2\t1.
-
Click, SEVERAL times, on the
Replace All
button exclusively ( Do not use theReplace
button ) till the messageReplace All: 0 occurrence were replaced
occurs ! ( In this example, you’ll need to click,3
times )
You’ll obtain :
bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, to go ~ : batmak, iflâs etmek, to be ~ : yoksun olmak. He seems to be ~ of all kind feelings : Her türlü asil duygulardan yoksun görünüyor. fraudulent/negli-gent - : kötü niyetli batkın, hileli müflis, 2. meteliksiz, 3. yoksun, mahrum, düşkün, an ~ intellectual: fikir yoksunu. ~ of intelligence : akılsız, a moral ~ : ahlâk düşkünü, to be - of manners : terbiyeden yoksun olmak, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. His embezzlement ~ed the company : Zimmetine para geçirmesi şirketi batırdı/iflâs ettirdi. bankruptly 1. müflisane, iflâs ede-rek/etmişçesine. bankruptable 1. müflisane, iflâs ede-rek/etmişçesine. bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, to go ~ : batmak, iflâs etmek, to be ~ : yoksun olmak. He seems to be ~ of all kind feelings : Her türlü asil duygulardan yoksun görünüyor. fraudulent/negli-gent - : kötü niyetli batkın, hileli müflis, 2. meteliksiz, 3. yoksun, mahrum, düşkün, an ~ intellectual: fikir yoksunu. ~ of intelligence : akılsız, a moral ~ : ahlâk düşkünü, to be - of manners : terbiyeden yoksun olmak, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. His embezzlement ~ed the company : Zimmetine para geçirmesi şirketi batırdı/iflâs ettirdi. bankruptly 1. müflisane, iflâs ede-rek/etmişçesine, to be ~ : yoksun olmak. bankruptable 1. müflisane, iflâs ede-rek/etmişçesine, to be ~ : yoksun olmak.
Finally, the third regex S/R, below, should get rid of all text, containing bold/italic sections :
-
SEARCH
(?<=[,.])[\w\x20]+?~.+?(?=\x20\d+|\R|\z)
-
REPLACE
Leave EMPTY
-
Click, ONCE on the
Replace All
button, exclusively ( Again, do not use theReplace
button )
And… here is your expected text :
bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, 2. meteliksiz, 3. yoksun, mahrum, düşkün, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. bankruptly 1. müflisane, iflâs ede-rek/etmişçesine. bankruptable 1. müflisane, iflâs ede-rek/etmişçesine. bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, 2. meteliksiz, 3. yoksun, mahrum, düşkün, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. bankruptly 1. müflisane, iflâs ede-rek/etmişçesine, bankruptable 1. müflisane, iflâs ede-rek/etmişçesine,
Now, give it a try of these
3
regexes , against your real text and verify if some problems still remain ;-))See you later,
Cheers,
guy038
-
-
Hi guy,
Thank you so much! For the bankrupt entry, we almost got there! I tried the three regexes several times, I might still have missed something but there seems to be a tiny problem with the second bankrupt entry. The first regex won’t join the “fee” and “lings…” together. I introduced a second line-break in the second bankrupt entry, this time it fixed the first line-break and joined “fee” and “lings…” together but didn’t touch the second one. Hence I got the following results respectively:bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, 2. meteliksiz, 3. yoksun, mahrum, düşkün, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. bankruptly 1. müflisane, iflâs ede-rek/etmişçesine. bankruptable 1. müflisane, iflâs ede-rek/etmişçesine. bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, lings : Her türlü asil duygulardan yoksun görünüyor. fraudulent/negli-gent - : kötü niyetli batkın, hileli müflis, 2. meteliksiz, 3. yoksun, mahrum, düşkün, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. bankruptly 1. müflisane, iflâs ede-rek/etmişçesine, bankruptable 1. müflisane, iflâs ede-rek/etmişçesine, bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, 2. meteliksiz, 3. yoksun, mahrum, düşkün, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. bankruptly 1. müflisane, iflâs ede-rek/etmişçesine. bankruptable 1. müflisane, iflâs ede-rek/etmişçesine. bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, niyetli batkın, hileli müflis, 2. meteliksiz, 3. yoksun, mahrum, düşkün, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. bankruptly 1. müflisane, iflâs ede-rek/etmişçesine, bankruptable 1. müflisane, iflâs ede-rek/etmişçesine,
I also applied the regexes to few other entries. They don’t seem to get the job done. Two things that I could point out:
-
There may be only 1, or 2 or more suffixes (up to 5) within an entry consecutively, e.g “7. ~ly: (definition(s)), 8 ~able: (definition(s)), 9. ~ness: definition(s)).”
-
A colon (:) may or may not, without or with one or more spaces before or after it, follow the respective suffix, only the numerals are consistent, i.e. each suffix is preceded by a numeral. Below are some possilibities, not all because you will get the idea:
[number+dot]+(0 space)+(~suffix)+(0 space)+(0 colon)+(0 space)+definition(s)
[number+dot]+(0 space)+(~suffix)+(0 space)+(1 colon)+(0 space)+definition(s)
[number+dot]+(1 space)+(~suffix)+(0 space)+(0 colon)+(1 space)+definition(s)
[number+dot]+(1 space)+(~suffix)+(0 space)+(1 colon)+(1 space)+definition(s)
[number+dot]+(1 space)+(~suffix)+(1 space)+(0 colon)+(1 space)+definition(s)
[number+dot]+(1 space)+(~suffix)+(1 space)+(1 colon)+(1 space)+definition(s)
[number+dot]+(0 space)+(~suffix)+(1 space)+(0 colon)+(1 space)+definition(s)
[number+dot]+(0 space)+(~suffix)+(1 space)+(1 colon)+(1 space)+definition(s)
[number+dot]+(2 or more spaces)+(~suffix)+(0 space)+(0 colon)+(0 space)+definition(s)
[number+dot]+(2 or more spaces)+(~suffix)+(1 space)+(0 colon)+(0 space)+definition(s)
[number+dot]+(2 or more spaces)+(~suffix)+(1 space)+(1 colon)+(0 space)+definition(s)
[number+dot]+(2 or more spaces)+(~suffix)+(0 space)+(0 colon)+(1 space)+definition(s)
[number+dot]+(2 or more spaces)+(~suffix)+(1 space)+(0 colon)+(1 space)+definition(s)
[number+dot]+(2 or more spaces)+(~suffix)+(0 space)+(1 colon)+(1 space)+definition(s)
[number+dot]+(2 or more spaces)+(~suffix)+(1 space)+(1 colon)+(1 space)+definition(s)
[number+dot]+(2 or more spaces)+(~suffix)+(0 space)+(0 colon)+(2 or more spaces)+definition(s)
[number+dot]+(2 or more spaces)+(~suffix)+(1 space)+(0 colon)+(2 or more spaces)+definition(s)
[number+dot]+(2 or more spaces)+(~suffix)+(1 space)+(1 colon)+(2 or more spaces)+definition(s)
[number+dot]+(2 or more spaces)+(~suffix)+(1 space)+(0 colon)+(2 or more spaces)+definition(s)
…
…
and all other permutations :(
-
-
@glossar, and All,
OK ! So just let’s split the problem in smaller pieces and focus on the first S/R ;-))
You said :
there seems to be a tiny problem with the second bankrupt entry. The first regex won’t join the “fee” and “lings…” together
From my regex, it should !! Of course, I assume that the TAB character (
\t
) only exists after each dictionary header word, only !So, first, could you verify that the
\t
char occurs right after each entry, only and never occurs elsewhere ?
Now, from this TEST_2 text, below, with some line breaks, between header words and additional line-breaks, added inside definitions (
5
in the definition#1
,1
in the definition#2
and2
in the definition#3
) :bankrupt 1. huk. bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, bor çlarını ödeye meyen kimse, to go ~ : batmak, iflâs etmek, to be ~ : yoksun olmak. He seems to be ~ of all kind fee lings : Her türlü asil duygu lardan yoksun görünüyor. fraudulent/negli-gent - : kötü niyetli batkın, hileli müflis, 2. meteli ksiz, 3. yoksun, mahrum, düşkün, an ~ intellectual: fikir yoksunu. ~ of intelligence : akılsız, a moral ~ : ahlâk düşkünü, to be - of manners : terbiyeden yoksun olmak, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. His embezzlement ~ed the company : Zimmetine para geçirmesi şirketi batırdı/iflâs ettirdi. 5. ~ly : müflisane, iflâs ede-rek/etmişçesine. 6. ~able : müflisane, iflâs ede-rek/etmişçesine. bankrupt 1. huk.
With my first regex :
-
SEARCH
(\R)\R*(?=\w+\t)|\R(?=[^\t\r\n]+\R)
-
REPLACE
?1\1
After clicking on the
Replace All
button, you should get this text :bankrupt 1. huk. bankrupt 1. huk. batkın, müflis, batmış, iflâs etmiş, borçlarını ödeyemeyen kimse, to go ~ : batmak, iflâs etmek, to be ~ : yoksun olmak. He seems to be ~ of all kind feelings : Her türlü asil duygulardan yoksun görünüyor. fraudulent/negli-gent - : kötü niyetli batkın, hileli müflis, 2. meteliksiz, 3. yoksun, mahrum, düşkün, an ~ intellectual: fikir yoksunu. ~ of intelligence : akılsız, a moral ~ : ahlâk düşkünü, to be - of manners : terbiyeden yoksun olmak, 4. batırmak, iflâs ettirmek, yoksun bırakmak, mahvetmek. His embezzlement ~ed the company : Zimmetine para geçirmesi şirketi batırdı/iflâs ettirdi. 5. ~ly : müflisane, iflâs ede-rek/etmişçesine. 6. ~able : müflisane, iflâs ede-rek/etmişçesine. bankrupt 1. huk.
Which proves that unnecessary line-breaks have been removed ! Could you confirm me that’s the text obtained, after the regex S/R ?
BR
guy038
-
-
Hi guy,
Just a quick confirmation: I’ve re-produced the same results with TEST_2 text and previous ones. I simply introduced a line-break to the very last line after copying&pasting by hitting the enter. This last line-break fixed the problem. I’ll continue to apply the regexes to severeal other entries and I’ill report problems in case I encounter.
Thank you so much for your time and effort! I do muchappreciate it!