Replacing in specific columns (More difficult than you'd think)
-
I have some twitter data in the form of TXT files that I’m using for academic purposes. For example, here’s some tweets by Senator Chuck Schumer:
As you can see there’s a tweet code, date and time, time zone code, twitter handle, and the content of the tweet, and the sections are delimited by space, which is obviously an issue as the contents of the tweets contain many spaces. I tried column select but replace in selection is greyed out
Replace all works fine here but I can’t select the specific columns (option is not greyed out if normal select is used). I was thinking about using Regex for this but I’m not sure how to get it to do what I want. Ideally code where I could simply replace the first five spaces in a line into commas would be ideal, since tweet code lengths can vary a little as you go back, but I don’t know how to do that. Maybe a Macro? -
@LingEd ,
data:
123456786 1999-12-31 23:59:57 -0400 <username> Three! with more spaces 123456787 1999-12-31 23:59:58 -0400 <username> Two! with more spaces 123456788 1999-12-31 23:59:59 -0400 <username> One! with more spaces 123456789 1900-01-01 00:00:00 -0400 <username> Happy Y2K Bug!
- FIND =
^(\S+) (\S+) (\S+) (\S+)\x20
(I used a\x20
, which is equivalent to a space character at the end to make it obvious that there’s something at the end, so you will get it when you copy/paste; if typing the regex, you could just use a space after the last parentheses) - REPLACE =
$1,$2,$3,$4,
- Search Mode = Regular expression
123456786,1999-12-31,23:59:57,-0400,<username> Three! with more spaces 123456787,1999-12-31,23:59:58,-0400,<username> Two! with more spaces 123456788,1999-12-31,23:59:59,-0400,<username> One! with more spaces 123456789,1900-01-01,00:00:00,-0400,<username> Happy Y2K Bug!
- FIND =
-
@PeterJones Thanks! Works like a charm
-
@PeterJones OK, sorry to bother, but new problem. When I import the altered files into Excel with comma as delimiter it creates too many columns when the tweets contain commas themselves. I was thinking what I could do is replace the 6th+ instances of commas (I altered the regex code to replace the first 5 commas instead of the first 4) in a line with a very uncommon character like “ɤ” and then search/replace that character in excel after the fact. The thing is, I don’t know how to write the Regex code to replace not just the 6th instance, but the 7th, 8th, 9th, etc instances. Thanks for the help!
-
@LingEd said in Replacing in specific columns (More difficult than you'd think):
I altered the regex code to replace the first 5 commas instead of the first 4
Congratulations. That means you understood what was going on. Knowing that people learn from what I write, rather than just copy/pasting and moving on, is always a good feeling.
@PeterJones OK, sorry to bother, but new problem. When I import the altered files into Excel with comma as delimiter it creates too many columns when the tweets contain commas themselves. I was thinking what I could do is replace the 6th+ instances of commas (…) in a line with a very uncommon character like “ɤ” and then search/replace that character in excel after the fact. The thing is, I don’t know how to write the Regex code to replace not just the 6th instance, but the 7th, 8th, 9th, etc instances. Thanks for the help!
That’s one good idea. If I were to do it that way, step 1 would be to just replace all commas with
ɤ
. Step 2 would be your 5-space-to-comma replacement from above.But since you’re trying to make valid CSV to open in Excel, CSV has a way of putting quotes around a field so that any commas inside will be treated as part of the text, not as a field separator. But that would mean that if you have any text with quotes in it, that will get messed up. But there’s a way around that by escaping the quote by changing any
"
to""
. So my procedure for what I think what you want with your data:- Search Mode = regular expression for all of this
- FIND =
"
REPLACE =""
to escape the quotes - FIND =
^(\S+) (\S+) (\S+) (\S+) (\S+) (.*$)
REPLACE =$1,$2,$3,$4,$5,"$6"
to change spaces to commas and to put quotes around the text.
-
@PeterJones can’t believe I didn’t think of that. Thanks!