Community
    • Login

    Replacing in specific columns (More difficult than you'd think)

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    regexcolumnregex
    6 Posts 2 Posters 490 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • LingEdL
      LingEd
      last edited by

      I have some twitter data in the form of TXT files that I’m using for academic purposes. For example, here’s some tweets by Senator Chuck Schumer:
      Chuck Schumer
      As you can see there’s a tweet code, date and time, time zone code, twitter handle, and the content of the tweet, and the sections are delimited by space, which is obviously an issue as the contents of the tweets contain many spaces. I tried column select but replace in selection is greyed out
      ChuckSchumer2.png
      Replace all works fine here but I can’t select the specific columns (option is not greyed out if normal select is used). I was thinking about using Regex for this but I’m not sure how to get it to do what I want. Ideally code where I could simply replace the first five spaces in a line into commas would be ideal, since tweet code lengths can vary a little as you go back, but I don’t know how to do that. Maybe a Macro?

      PeterJonesP 1 Reply Last reply Reply Quote 0
      • PeterJonesP
        PeterJones @LingEd
        last edited by

        @LingEd ,

        data:

        123456786 1999-12-31 23:59:57 -0400 <username> Three! with more spaces
        123456787 1999-12-31 23:59:58 -0400 <username> Two! with more spaces
        123456788 1999-12-31 23:59:59 -0400 <username> One! with more spaces
        123456789 1900-01-01 00:00:00 -0400 <username> Happy Y2K Bug!
        
        • FIND = ^(\S+) (\S+) (\S+) (\S+)\x20
          (I used a \x20, which is equivalent to a space character at the end to make it obvious that there’s something at the end, so you will get it when you copy/paste; if typing the regex, you could just use a space after the last parentheses)
        • REPLACE = $1,$2,$3,$4,
        • Search Mode = Regular expression
        123456786,1999-12-31,23:59:57,-0400,<username> Three! with more spaces
        123456787,1999-12-31,23:59:58,-0400,<username> Two! with more spaces
        123456788,1999-12-31,23:59:59,-0400,<username> One! with more spaces
        123456789,1900-01-01,00:00:00,-0400,<username> Happy Y2K Bug!
        
        LingEdL 1 Reply Last reply Reply Quote 1
        • LingEdL
          LingEd
          last edited by

          @PeterJones Thanks! Works like a charm

          1 Reply Last reply Reply Quote 0
          • LingEdL
            LingEd @PeterJones
            last edited by

            @PeterJones OK, sorry to bother, but new problem. When I import the altered files into Excel with comma as delimiter it creates too many columns when the tweets contain commas themselves. I was thinking what I could do is replace the 6th+ instances of commas (I altered the regex code to replace the first 5 commas instead of the first 4) in a line with a very uncommon character like “ɤ” and then search/replace that character in excel after the fact. The thing is, I don’t know how to write the Regex code to replace not just the 6th instance, but the 7th, 8th, 9th, etc instances. Thanks for the help!

            PeterJonesP 1 Reply Last reply Reply Quote 0
            • PeterJonesP
              PeterJones @LingEd
              last edited by PeterJones

              @LingEd said in Replacing in specific columns (More difficult than you'd think):

              I altered the regex code to replace the first 5 commas instead of the first 4

              Congratulations. That means you understood what was going on. Knowing that people learn from what I write, rather than just copy/pasting and moving on, is always a good feeling.

              @PeterJones OK, sorry to bother, but new problem. When I import the altered files into Excel with comma as delimiter it creates too many columns when the tweets contain commas themselves. I was thinking what I could do is replace the 6th+ instances of commas (…) in a line with a very uncommon character like “ɤ” and then search/replace that character in excel after the fact. The thing is, I don’t know how to write the Regex code to replace not just the 6th instance, but the 7th, 8th, 9th, etc instances. Thanks for the help!

              That’s one good idea. If I were to do it that way, step 1 would be to just replace all commas with ɤ. Step 2 would be your 5-space-to-comma replacement from above.

              But since you’re trying to make valid CSV to open in Excel, CSV has a way of putting quotes around a field so that any commas inside will be treated as part of the text, not as a field separator. But that would mean that if you have any text with quotes in it, that will get messed up. But there’s a way around that by escaping the quote by changing any " to "". So my procedure for what I think what you want with your data:

              1. Search Mode = regular expression for all of this
              2. FIND = "
                REPLACE = "" to escape the quotes
              3. FIND = ^(\S+) (\S+) (\S+) (\S+) (\S+) (.*$)
                REPLACE = $1,$2,$3,$4,$5,"$6" to change spaces to commas and to put quotes around the text.
              LingEdL 1 Reply Last reply Reply Quote 1
              • LingEdL
                LingEd @PeterJones
                last edited by

                @PeterJones can’t believe I didn’t think of that. Thanks!

                1 Reply Last reply Reply Quote 0
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors