Community
    • Login

    Editing 600 mega XML file

    Scheduled Pinned Locked Moved General Discussion
    17 Posts 7 Posters 79.0k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • RicardoR Offline
      Ricardo @Maor Bachar
      last edited by

      @Maor-Bachar said:

      seem like it doesn’t have XML style?

      I think they unfortunately restrict extra coloring schemes to their paid version: http://www.editpadpro.com/cscs.html

      1 Reply Last reply Reply Quote 0
      • AJ BaxterA Offline
        AJ Baxter
        last edited by

        What size XML can you load in N++ ?

        1 Reply Last reply Reply Quote 0
        • RicardoR Offline
          Ricardo
          last edited by Ricardo

          I don’t think file type matters, but…
          If I have no plugins and no other file opened, I can open files with maximum size around 362MB, before it displays “File is too big to be opened” error.
          But with my plugins enabled, the limit lowers to about 150MB.

          1 Reply Last reply Reply Quote 0
          • RicardoR Offline
            Ricardo
            last edited by Ricardo

            Hello, for people needing a x64 build of Notepad++ I made one here.

            Notes:

            1. This is an easy unofficial build – not tested by devs.
            2. You need 64-bit OS.
            3. It is not compatible with 32-bit plugins (ie. all available plugins). You need to rename or move to other place your /plugins folder.
            4. It can open very large files. Important: make sure “Word Wrap” is disabled before opening such files.
            5. The 7z package above contains only the binaries you need to replace. Please do a backup of your current 32bit files (rename or move).
            1 Reply Last reply Reply Quote 0
            • David BaileyD Offline
              David Bailey
              last edited by

              I really think that anyone requiring to edit a 600MB XML file should be thinking of alternatives! I mean those files contain structures, and you really need to use something that will respect that structure - which isn’t a text editor.

              I don’t know what this file contains, but it might be useful to arrange that it was stored as a number of much smaller files. Alternatively, I guess you could load its structure into Mathematica, manipulate it, and write it out again!

              I don’t like to see a tool like NP++ being pushed to perform basically ridiculous tasks.

              David

              1 Reply Last reply Reply Quote 0
              • Klaus LehmannK Offline
                Klaus Lehmann
                last edited by

                hi david
                excuse me.
                be happy, that You never had to edit a 10GB-xml-file.
                therefor notepad++ isn’t NOT Your friend ;-)
                but, I want to give y try to ricardo’s 64bit edition! it’s new to me ;-)
                Yours klaus

                1 Reply Last reply Reply Quote 0
                • David BaileyD Offline
                  David Bailey
                  last edited by David Bailey

                  Klaus,

                  “be happy, that You never had to edit a 10GB-xml-file.”

                  I am indeed - but I am also happy that I have never tried to solder electronic components with a blow torch, or fry an egg on a smoothing iron! Even if I had, I wouldn’t try to make suggestions on a blow torch forum regarding improvements to blow torches that might make that more feasible!

                  I think a lot of software gradually bloats out - both in bytes and in terms of complexity. Ultimately valuable software can die that way! I think NP++ has remained consistently focussed on providing solutions for those who need to edit normally sized text files - particularly program source - and trying to add on outlandish capabilities would be a severe distraction.

                  Switching to 64-bits would probably only be the first step in a project to edit 10 GB data files, because I am sure there must be many processes inside NP++ that depend on being able to scan across an entire file in a sensible amount of time.

                  It might be more constructive if you described how you got into this mess, and someone might be able to offer some constructive suggestions!

                  As a preliminary suggestion, I would suggest that you write a C program to read the file, and recognise the data you want to change. You could then modify the program to actually perform a change (but I would back up your file before you start :) ).

                  David

                  1 Reply Last reply Reply Quote 0
                  • RicardoR Offline
                    Ricardo
                    last edited by

                    woot
                    For editing a 10GB file, I guess the best would be an editor that doesn’t load the entire file into memory, but reads it direct from disk. Otherwise, you would need a lot of RAM.

                    Klaus LehmannK 1 Reply Last reply Reply Quote 0
                    • guy038G Offline
                      guy038
                      last edited by guy038

                      Hello Maor Bachar and All,

                      Why don’t you give a try to the old, but excellent, script program GAWK.exe, which can be used, in addition, both, on Unix or Windows machines (with the appropriate executable file, of course ) ?

                      You can get the latest Windows version of gawk.exe, AFAIK, ( v4.1.0 ), at the address, below :

                      https://code.google.com/p/gnu-on-windows/downloads/detail?name=gawk-4.1.0-bin.zip&can=2&q=

                      To get an overview of the main features of gawk and what’s new in the v4.1.0 version, follow the link :

                      http://www.drdobbs.com/open-source/gnu-awk-this-is-not-your-fathers-awk/240158351

                      And, here is, below, the link to download the last reference book, on GAWK, v4.1.x, in various formats :

                      http://www.gnu.org/software/gawk/manual/

                      The PDF form is quite recent : April 2015 !


                      To test it, I created a 1 Gb file. ( I didn’t create a 10 Gb file, as my old Win XP laptop, with only two 40Gb partitions would not accept it ! ). But, when I go back to work , as I’m presently, on holidays, I’ll be able to build bigger files !

                      Briefly, I first created, in N++, a 200 Mb file, about, named xxx.txt. Then, I recopied this file, five times, to the All.txt file, in a DOS windows, with the commands :

                      type xxx.txt>All.txt
                      type xxx.txt>>All.txt
                      type xxx.txt>>All.txt
                      type xxx.txt>>All.txt
                      type xxx.txt>>All.txt
                      

                      I ended the All.txt file with a last line

                      echo END of the FILE>>All.txt
                      

                      Length of the lines are from 0 to 150. Finally, the All.txt file contains, about, 18 498 000 lines, for 1026 Mo bytes !


                      I used the simple gawk script below, named Script.txt. It’s IMPORTANT to note that this file is ANSI encoded

                      #----------------------------------------------------------------------------------------------------------------#
                      #  SYNTAX to RUN, in a CMD windows, from the FOLDER, where are the 3 files GAWK.exe, Script.txt and All.txt      #
                      #            ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯  ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯                        ¯¯¯¯¯¯¯¯  ¯¯¯¯¯¯¯¯¯¯     ¯¯¯¯¯¯¯      #
                      #                                                                                                                #
                      #  SYNTAX :                                                                                                      #
                      #  ¯¯¯¯¯¯                                                                                                        #
                      #                                                                                                                #
                      #    echo [Regex|String]|gawk -f Script.txt - All.txt[ File2.txt[ File3[ ...]]][ >[>]See.txt]                    #
                      #----------------------------------------------------------------------------------------------------------------#
                      
                      {
                        if ( pattern == "" )
                          {
                      
                            #---------------------------------------------------------------------------#
                            #  The STANDARD INPUT -  is read, BEFORE the USER FILE(S), so ALL the TEXT  #
                            #    of the DOS COMMAND 'echo' will be STORED in the VARIABLE 'pattern'     #
                            #                                                                           #
                            #   Note : No BLANK, at END of "echo" command, unless part of the REGEX     #
                            #                                                                           #
                            #  IF the PATTERN is NOT initialized  :                                     #
                            #    SET the VARIABLE 'total' to the value 0                                #
                            #    SET the VARIABLE 'pattern' to the value of the LINE field $0           #
                            #      = ALL the TEXT of the DOS COMMAND 'echo'                             #
                            #    SKIP to the NEXT line to READ                                          #
                            #---------------------------------------------------------------------------#
                      
                            total = 0 ; pattern = $0
                      
                            next
                          }
                      }
                      
                      #------------------------------------------------------------------------------#
                      #    IF the CURRENT line MATCHES the PATTERN of the VARIABLE 'pattern' :       #
                      #      INCREMENT, by ONE, the VARIABLE 'total' [ and PRINT the CURRENT line ]  #
                      #    SKIP to the NEXT line to READ, in ALL cases                               #
                      #------------------------------------------------------------------------------#
                      
                      {
                        if ($0 ~ pattern) { ++total }
                      
                      #   OR     if ($0 ~ pattern) { ++total ; print }
                      
                        next
                      }
                      
                      
                      #-------------------------------------------------------------------------------------------------------------------#
                      #  AFTER ALL the USER files are READ, DISPLAYS the NUMBER of lines, MATCHING the PATTERN of the VARIABLE 'pattern'  #
                      #-------------------------------------------------------------------------------------------------------------------#
                      
                      END { print "\n  Number of lines MATCHING \x22" pattern "\x22  : " , total }
                      

                      The command echo .*|gawk -f Script.txt - all.txt give me the number 18 498 231, which is the number of lines matching the regex .*

                      The command echo END of the FILE|gawk -f Script.txt - all.txt give me the number 1, noticing that it correctly read the last and 18 498 231th line, of the 1 Gb file All.txt !

                      On my old Win XP laptop, with, only, 1GB RAM ( 2 * 512 Mo ), I got the result, from each command, in 1mn 30s about ! Not to bad, isn’t it ?

                      Of course, my script just count the matchings but, with GAWK, you can manipulate files, in multiple ways, do calculus or searches/replacements or … It’s a very powerful tool, although tiny in size : 223 246 bytes, for the v4.1 version.

                      You’ll just have to scan, a bit, the gawk documentation !

                      And, as Ricardo said, it seems, to directly read files, from disk :-) => No limit in size, seemingly

                      Best Regards,

                      guy038

                      1 Reply Last reply Reply Quote 0
                      • Klaus LehmannK Offline
                        Klaus Lehmann @Ricardo
                        last edited by

                        @Ricardo
                        I think crisp (64bit) can do it!
                        but it costs 100-250 per Year!

                        and for my biggest files (with rubbish-xml-gore) in 10GB files, I must to use: VEDIT Pro (64bit)
                        attention: high prize! $240 (no joke!)

                        Yours klaus

                        1 Reply Last reply Reply Quote 0

                        Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                        Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                        With your input, this post could be even better 💗

                        Register Login
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors