multi-word expressions across lines
-
Hello everybody,
I hope you are well. I was wondering whether somebody could help me finalise a regular expression I’ve been bogged down with for a while.
I am trying to capture those expressions where gay.* and homosexual.* appear with HIV and AIDS (the order of the two limits can be reversed).The regex I came up with is the following:
\b(gay|homosex|hiv|aids|disease|virus|condition) ([A-Z0-9.,-;/]+ ){1,500}(gay|homosex|hiv|aids|disease|virus|condition)\bbut unfortunately few expressions where the virus and the sexuality appear in the text are missed out by the regex. In the small text attached below, only two expressions are captured (‘AIDS and HIV’ in the title - I don’t understand why this is the case, as I wanted the sexuality and the virus to appear together) and AIDS…homosexual (in the penultimate sentence).
Thank you so much for your help and contributions. Looking forward to them,
Ivan
Latest AIDS and HIV figures for Scotland
During April to June 1996, 35 cases of HIV infection were reported to the Scottish Centre for Infection and Environmental Health.Twenty of these were in homosexual/bisexual males, six in injecting drug users and four in persons who were probably infected heterosexually; five cases are as yet undetermined.Thirteen of the cases were from Lothian and thirteen from Greater Glasgow; 29 cases were male.The cumulative total for HIV infected cases to June 30, 1996 is 2,452; 750 (30.6%) are homosexual/bisexual males, 1,102 (44.9%) are injecting drug users and 391 (15.9%) are thought to have been infected heterosexually.The cumulative total of HIV infection for the United Kingdom is now 27,033 of which 23,001 are male and 4,032 female. The majority, 16,542 are homosexual/bisexual males but 5,060 are in the heterosexual non-injecting drug user category.During the second quarter of 1996, 20 cases of AIDS were reported to SCIEH by clinicians, of which 18 were males; ten were homosexual/bisexual males, four injecting drug users and five in the heterosexual category. Ten of the 20 cases were from Lothian, six from Grampian and four from Greater Glasgow. The cumulative total for AIDS cases to June 30, 1996 is 772 of whom 571 have died.309 cases have been in homosexual/bisexual males, 279 in injecting drug users and 114 in heterosexuals. For the UK as a whole, 12,976 cases of AIDS have now been registered, of which 9,344 have been in homosexual/bisexual males.
-
Hello Ivan,
Probably, a nicer regex to achieve what you’re looking for, would be :
(?i)(HIV|AIDS).+?(gay|homosexual)|(?2).+?(?1)
Notes :
-
First of all, the modifier
(?i)
forces the search to be case insensitive. Then : -
The part
(HIV|AIDS)
looks for one of the two possible names of that disease -
The part
(gay|homosexual)
tries to match one of the names, for that sexuality -
The part
.+?
represents the shortest non null amount of text, between the strings HIV or AIDS AND gay or homosexual !
Else :
-
The subroutine call
(?2)
refers to the group 2(gay|homosexual)
-
The subroutine
call (?1)
refers to the group 1(HIV|AIDS)
-
Again, the part
.+?
represents the shortest non null amount of text between the strings gay or homosexual AND the strings HIV or AIDS !
With your given text, to which I added the simple sentence, below, in order to test the second case :
The cumulative total is 750 homosexual/bisexual males, infected by the HIV, on June 30, 1996
I obtained the 7 captures of text, below :
1 HIV infection were reported to the Scottish Centre for Infection and Environmental Health.Twenty of these were in homosexual 2 HIV infected cases to June 30, 1996 is 2,452; 750 (30.6%) are homosexual 3 HIV infection for the United Kingdom is now 27,033 of which 23,001 are male and 4,032 female. The majority, 16,542 are homosexual 4 AIDS were reported to SCIEH by clinicians, of which 18 were males; ten were homosexual 5 AIDS cases to June 30, 1996 is 772 of whom 571 have died.309 cases have been in homosexual 6 AIDS have now been registered, of which 9,344 have been in homosexual 7 homosexual/bisexual males, infected by the HIV
Hope that this slicing is what you’re looking for :-)
Best regards,
guy038
P.S :
-
The quantifiers
*
,+
,?
,{n,m}
and{n,}
are considered as greedy quantifiers -
The quantifiers
*?
,+?
,??
,{n,m}?
and{n,}?
are considered as lazy quantifiers
Some examples :
Given the subject string : aaaaaaa333aaaaaaa33333aaaaaaa33333aaaaaaaa333aaaaaaa
-
The regex
a.+3
captures the string aaaaaaa333aaaaaaa33333aaaaaaa33333aaaaaaaa333 -
The regex
a.+?3
captures the string aaaaaaa3 -
The regex
a.+333
captures the string aaaaaaa333aaaaaaa33333aaaaaaa33333aaaaaaaa333 -
The regex
a.+?333
captures the string aaaaaaa333 -
The regex
a.+33333
captures the string aaaaaaa333aaaaaaa33333aaaaaaa33333 -
The regex
a.+?33333
captures the string aaaaaaa333aaaaaaa33333
-
-
Hello guy038,
Thank you so much for taking the trouble to reply to my query. This is very helpful and kind of you.
I found your expression very useful, as there is no doubt yours is nicer and more powerful.
I ran it across a sample of my corpus and I noticed that the regex captures the text that is comprised within the HIV/AIDS and gay or homosexual. This means that if in the text there is a first reference to HIV/AIDS and halfway through the text a second reference to gay/homosexual (references can be inverted as suggested in your regex too), the entire chunk of text is captured. This however poses a problem because the captures found may not identify a real connection between gay and HIV.That is why I was thinking that maybe I should put some parameters (number of words), but it definitely doesn’t capture everything.
I’ve attached a text where the above occurs.
Would you recommend using wild cards to capture also gays or homosexuals?
Thank you so much for your help and attention. You are really helping me get out of this horrible cul-de-sac!
Best wishes,
Ivan
He was just three months early.’
Eric ‘Eazy-E’ Wright died on March 26, 1995, from complications following Aids -
a combination of a collapsed lung causing heart failure and pneumonia. He was
just 31 and had checked himself into hospital with chronic breathing problems
only a month before, completely unaware he was carrying a fatal virus. He left a
wife, Tomica, whom he married in hospital. They were together for four years and
had two children, Dominic, two, and Deijah, born six months after her father’s
death. All have tested negative for HIV.Long before he passed away, Eric Wright had indeed put Compton on the map. And
as the founding father of gangsta rap, he was arguably one of popular culture’s
most influential figures of the last quarter of the century. As an entrepreneur,
he was an inspiration to millions, and with his death he metamorphosed swiftly
from mogul to martyr. Rarely can such a short life have been so symbolic.Wright failed to stay the distance at Dominguez High School in Compton, and by
the early Eighties was a regular hustler, a dealer in crack and pot. It must
have been the lure of money or excitement that enticed him, for the home
provided for him and his younger brother and sister was a stable one. His mother
was a Montessori teacher, his father, a retired post office wo rker and sometime
musician who had a big hit with the 103rd Street Rhythm Band back in the early
Seventies with Express Yourself, later covered by NWA. (Both are still alive,
though his father recently suffered a stroke.) Some of the streets in Compton
are strangely quaint: rows of polite bungalows fronted by porches and lawns
enriched by sub-tropical weather. Some are unremittingly grim, and the main
thoroughfare, Long Beach Boulevard, is a sorry strip of boarded-up businesses,
dishevelled lots, soiled fast-food joints and two-bit stores. Wright came from a
pleasanter part, and it may seem he was self-consciously ‘dropping down’ by his
first choice of career. But with US ghettos it is wise to recall the phrase, ‘it
takes a village to raise a child’. A kid is part of the street environment, like
it or not. And Eric Wright obviously did.A childhood friend, Big Man (he declined to give his real name), who bears some
resemblance to the splendidly sonorous and rotund singer Barry White - though
his voice is a few notes higher and his girth a couple of feet narrower -
remembers: ‘Unlike Eric, I went all the way through school.’ His smile broadens:
‘Right in the front, right out the back, I never stayed for a class, though I
did have a dice game third period every day. I didn’t run into him on campus too
much.’ As an adult, Wright was so small - a slender 5ft 4in - that everyone
presumed he was a drop -out kid. Most didn’t know his real age until he died.
But he had presence - the presence of money, even when he didn’t have that much.
People noticed him strutting down the street. He wore the same trademark clothes
as the other home boys, but less baggy, with more style. To rib him, close
friends called him ‘casual’. Perhaps his first vehicle, a psychedelically
-painted truck, wasn’t so hip, but Eric - he was always Eric or Little Man,
never Eazy, to his friends - had respect.He started organising parties with a friend, Andre Young, aka Dr Dre, who had
real talent, and was a member of the World Class Wreckin’ Cru with Antoine
Carraby, aka DJ Yella. Dre began telling Wright what he knew of the music
business. Smart enough to know drug dealing couldn’t last forever, Eric wanted
out. Towards the end of 1985 he took, and passed, the test to join the post
office. But he had also seen that music, like drugs, offered power, and decided
to act.At the time west coast rap was lame, party stuff about the good times - where
people wanted to be, where they would be soon. It came a poor second to the
east coast, where hip-hop had begun in the late Seventies and burst across the
world with the Sugar Hill Gang’s Rapper’s Delight.In the spring of 1986, aged 22, Wright assembled the best talent in the 'hood:
Dre, Yella, MC Ren (Lorenzo Patterson) and Ice Cube (O’Shea Jackson). He paid
for studio time and urged Cube, the best rapper and songwriter, to write
something ‘real’, something about the gangs, about the life he’d been leading.
The result was Boyz 'N The Hood (later the title of a film) and the birth of
NWA. Wright paid $ 7,000 for 10,000 12-inch records, and he and Dre would drive
around - by now in a burgundy Suzuki Samurai - selling the discs at swap-meets,
a cross between a flea market and a car-boot sale. By word of mouth alone, the
record sold 500,000 copies.The next year, Wright hooked up with Jerry Heller, a music business veteran, who
recognised this hardcore stuff as the next big thing. He took Wright to Priority
Records, where the boy from the 'hood talked his way into a unique deal:
Priority would distribute records on Wright’s own label, Ruthless, and Wright
would have, for a newcomer, an unheard of piece of the action.In 1988, Eazy-E’s first solo album, Eazy Duz It, and NWA’s Straight Outta
Compton, sold more than 5.5 million copies between them - without a single play
on radio or television. Suddenly they were stars - a lifestyle that’s easy to
slip into in LA. Eazy bought a $ 1 million house in a cul-de-sac in Westlake, a
suburb predominated by retired people. Dre and Yella had a house next door and
they shared a party house over the road. By night, they were the people everyone
wanted to know.As a rapper, Wright had a distinctive, high-pitched, brattish voice, but was not
one of the very best. It could take him hours to get his part down right. Nor
was he a great songwriter. But he cemented NWA and determined their direction.
MC Ren recalls: ‘It just wouldn’t have happened without him. Even though he
wasn’t writing - it was his voice that grabbed everybody. And he had the idea of
putting that shit together, that all-star group. He saw something I didn’t see
and that shit just clicked. Others were doing something like it but not to the
full depth of what we were doing. It shocked a lot of motherfuckers.’ Even the
group’s full name - Niggaz With Attitude - was a challenge to a liberal
mainstream that had shunned the n-word in the post-Martin Luther King era. Now
it was thrust back in their white faces, worn as a badge of honour, a viciously
ironic statement that, dammit, young blacks still felt they were treated like
niggers. Gangsta rap, like all black American music, was rooted in the Blues -
with its lyrical expressiveness and rhythmic foundations - but added to the
plaintiveness of old was an angry, cussing criticism of authority that
represented a new departure.NWA, and the genre they spawned, represented a nightmare for the respectable
world - and many feminists: a live and kicking validation of foul language,
violence and misogyny. They were reprimanded by the FBI over their song Fuck Tha
Police on Straight Outta Compton, and joined a line of demonised performers that
began with Elvis and continued through Jimi Hendrix, Sly Stone, the Rolling
Stones and the Sex Pistols. With their baseball caps, baggies and snow-white
trainers, they influenced fashion across the world. Their tales of drugs, police
intimidation, gang violence and alienation jump-started a new brand of American
film-making and told the world of black urban angst long before the Rodney King
beating and subsequent LA riots. America’s political leaders had to listen.
Attitude had arrived.By the time NWA’s 1991 album Efil4Zaggin (Niggaz4Life spelt backwards) became
the first rap album to top the US Billboard charts, middle-class white kids
across America, and indeed Europe, were draping themselves in polyester cotton,
greeting each other with high fives and ‘Yo!’, and were fully acquainted
vicariously with the hip Hades of motherfuckers, bitches and hoes.Ruthless Records artists have sold more than 28 million records, with 21
recordings reaching gold or platinum status. After NWA split, Ice Cube and Dr
Dre soared. Eazy-E’s final album, Str8 Off Tha Streets Of Muthaphu**in Compton,
which is released on Monday, is sure to hit the top back home. Priority Records
(now separate from Ruthless) has meanwhile produced a greatest hits package,
Eternal E, also available here.Wright once quoted his personal wealth at $ 60 million and Ruthless’s value at $
20 million. For a time, it was one of the most successful independent labels of
all time, and inspired many imitators, many rivals. For black Americans, Wright
became a beacon of entrepreneurship, even more so than Russell Simmons of Def-
Jam Records on the east coast, because no one, but no one, had ever come off the
streets, turned their back on dealing dope and made it big.But within days of his death, a struggle for his legacy was underway. Michael
Klein, the business manager at Ruthless, disputed Wright’s will which left the
company to Tomica, claiming Wright handed half of it over to him in 1992.
Tomica, a former assistant to the chairman of Motown, promised her dying husband
she would keep the company alive. Klein is also questioning Wright’s state of
mind when he married and the validity of making his wife a trustee of his
estate.‘Tomica has the ambition and the ability to run the company. She has worked in
the business and knows the ins and outs and she is a bright woman, she’s no
airhead,’ says Ernie Singleton, formerly head of MCA’s black music division who
was brought in as acting president of Ruthless by the California Superior
Courts.‘It was a real love affair,’ says a family friend. ‘She certainly loved him. And
Eric, he was obviously no goody-two shoes but he was extremely devoted. He
trusted her judgment.’ Ruthless is now back up on its feet - the doors were
padlocked for several weeks last spring after things started disappearing - but
the legal grind could, according to music industry experts, take another three
years. Ruthless may be whittled away in lawyers’ fees, taking Wright’s dream of
a large-scale, multi-media black company with it.In America, there isn’t the pressure there is here on artists to produce fresh
sounds. Fans stay loyal. And from Eazy they expected the familiar rough stuff,
which largely explains why the subject matter on the first Eazy-E album and the
last is so similar, though musically the latter is richer, more layered, bumping
along in a smooth groove. Alongside the usual bravado and gang tales, there are
tracks such as Nuts On Ya Chin and Lickin, Suckin And Fuckin. The heavy sex
content of Eazy’s raps, the brazen assertion of sexual prowess, the arrogant
expectation of female submission, lend a grim irony to his death.It almost goes without saying that he came from an environment where Aids is
very largely viewed as a gay disease. ‘It does have that image,’ says Cassandra
Ware, vice-president of Ruthless Records and a friend of Wright’s before she
arrived at the company. 'Because of that heavy machismo, like, ‘never will I
take it in the butt, so I’ll be all right’. And Eric was one of those machismos- ‘I’m a star, killer, handsome buck of a man. Any woman who comes my way I’ll
ride her’ - and it happened.’
- ‘I’m a star, killer, handsome buck of a man. Any woman who comes my way I’ll
-
@guy038
Hello guy038,Thank you so much for taking the trouble to reply to my query. This is very helpful and kind of you.
I found your expression very useful, as there is no doubt yours is nicer and more powerful.
I ran it across a sample of my corpus and I noticed that the regex captures the text that is comprised within the HIV/AIDS and gay or homosexual. This means that if in the text there is a first reference to HIV/AIDS and halfway through the text a second reference to gay/homosexual (references can be inverted as suggested in your regex too), the entire chunk of text is captured. This however poses a problem because the captures found may not identify a real connection between gay and HIV.That is why I was thinking that maybe I should put some parameters (number of words), but it definitely doesn’t capture everything.
I’ve attached a text where the above occurs.
Would you recommend using wild cards to capture also gays or homosexuals?
Thank you so much for your help and attention. You are really helping me get out of this horrible cul-de-sac!
Best wishes,
Ivan
(PS: the text is in my previous post. Apologies if this message is a repetition of the previous one. I don’t mean to harass you with lots of question. I just thought that you might have not seen the previous reply as a result of me not tagging you. Many thanks again for your generous help!)
-
Hi Ivan,
Sorry for my late reply. I tried to deeply think about your problem :-)
I’m wondering… Looking from any significant word, with the simple regex
gay|homosexual|hiv|aids|disease|virus|condition
, in your second text, I only found :-
The three words Aids, virus and HIV, in that order, inside the first paragraph
-
The words Aids, gay and disease, in that order, inside the last paragraph
So, I’m a bit confused. Which relation would you like to be occured, with the help of regexes ?
To my mind, the true problem is :
- How many couples of words do you consider valuable to search for ?
- Which minimum/maximum distance must separate the two words, of a couple ?
It’s quite a difficult question, which seems to be more a linguistic matter that a regex matter ! Even worse, if we consider that you might want a maximum distance d1 between two words and an other maximum distance d2 between two other words !
I’m quite optimist about building regexes for specific purposes. That’s not the main problem. As usual, you just need to, exactly, define what you expect to :-))
To fully understand the complexity of the simple search of TWO words, inside any text ( although it could have seemed easy, at first sight ! ), let’s consider the example text below, in one line :
1----A----1----B----1----C----9----D----9----E----1----F----9----G----1----H----1----I----9----J----
Well. Suppose you’re looking for ranges of characters, which, either :
-
Begins with the 1 digit and ends with the 9 digit
-
Begins with the 9 digit and ends with the 1 digit
Between these two limits 1 and 9, there are ten ranges of characters, from zone A to zone J
Several interpretations are possible :
-
With the regex
1\K.*?(?=9)|9\K.*?(?=1)
, we obtain 5 ranges of characters, with their digit limits, below :- Zone 1----A----1----B----1----C----9
- Zone 9----D----9----E----1
- Zone 1----F----9
- Zone 9----G----1
- Zone 1----H----1----I----9
-
With the regex
1\K(?=.*?9)|9\K(?=.*?1)
, we obtain 9 ranges of characters, with their digit limits, below :- Zone 1----A----1----B----1----C----9
- Zone 1----B----1----C----9
- Zone 1----C----9
- Zone 9----D----9----E----1
- Zone 9----E----1
- Zone 1----F----9
- Zone 9----G----1
- Zone 1----H----1----I----9
- Zone 1----I----9
-
With the regex
1\K[^1\r\n]*?(?=9)|9\K[^9\r\n]*?(?=1)
, we obtain 5 ranges of characters, with their digit limits, below :- Zone 1----C----9
- Zone 9----E----1
- Zone 1----F----9
- Zone 9----G----1
- Zone 1----I----9
Notes :
-
With the first regex, the regex engine matches, from cursor location, either :
- The shortest next range 1…9
- The shortest next range 9…1
So, the regex engine, alternatively, find a 1…9 zone, then a 9…1 zone, then a 1…9 zone, and so on…
-
With the second regex, the regex engine matches, from cursor location, either :
- The zero length location, just after the next limit 1, which begins a 1…9 zone
- The zero length location, just after the next limit 9, which begins a 9…1 zone
-
With the third regex, the regex engine matches, from cursor location, either :
- The shortest next range 1…9, which does NOT contain any limit 1, inside that range
- The shortest next range 9…1, which does NOT contain any limit 9, inside that range
So, in that example, which kind of search would be pertinent ( or a new one, different from the three above ! ), to your mind ?!
See you later !
Best regards,
guy038
P.S. :
If the example text is split in several lines, as below :
1----A----1--- -B----1----C- ---9----D----9-- --E----1-- --F----9----G--- -1----H-- --1----I--- -9----J----
The 3 regexes, described above, must be rewritten, as below :
1\K(?s).*?(?=9)|9\K(?s).*?(?=1)
1\K(?s)(?=.*?9)|9\K(?s)(?=.*?1)
1\K(?s)[^1]*?(?=9)|9\K(?s)[^9]*?(?=1)
-
-
Hi @guy038,
Only now did I notice that my answer thanking you for your help didn’t go through for some reason. I very much appreciated your help and your advice on my issue.
In the end, what I did was to apply a slightly tweaked version of the first regex. I manually disambiguated all the possible combinations and chunks of text that contained the parameters ‘gay,gays, homosexual, Aids, HIV…’ and assessed the relevance and correctness of the results, text by text.
Thank you again for your help.
Keep in touch and have a good day,
Ivan