Community
    • Login

    Editing 600 mega XML file

    Scheduled Pinned Locked Moved General Discussion
    17 Posts 7 Posters 76.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • RicardoR
      Ricardo @Maor Bachar
      last edited by

      @Maor-Bachar said:

      seem like it doesn’t have XML style?

      I think they unfortunately restrict extra coloring schemes to their paid version: http://www.editpadpro.com/cscs.html

      1 Reply Last reply Reply Quote 0
      • AJ BaxterA
        AJ Baxter
        last edited by

        What size XML can you load in N++ ?

        1 Reply Last reply Reply Quote 0
        • RicardoR
          Ricardo
          last edited by Ricardo

          I don’t think file type matters, but…
          If I have no plugins and no other file opened, I can open files with maximum size around 362MB, before it displays “File is too big to be opened” error.
          But with my plugins enabled, the limit lowers to about 150MB.

          1 Reply Last reply Reply Quote 0
          • RicardoR
            Ricardo
            last edited by Ricardo

            Hello, for people needing a x64 build of Notepad++ I made one here.

            Notes:

            1. This is an easy unofficial build – not tested by devs.
            2. You need 64-bit OS.
            3. It is not compatible with 32-bit plugins (ie. all available plugins). You need to rename or move to other place your /plugins folder.
            4. It can open very large files. Important: make sure “Word Wrap” is disabled before opening such files.
            5. The 7z package above contains only the binaries you need to replace. Please do a backup of your current 32bit files (rename or move).
            1 Reply Last reply Reply Quote 0
            • David BaileyD
              David Bailey
              last edited by

              I really think that anyone requiring to edit a 600MB XML file should be thinking of alternatives! I mean those files contain structures, and you really need to use something that will respect that structure - which isn’t a text editor.

              I don’t know what this file contains, but it might be useful to arrange that it was stored as a number of much smaller files. Alternatively, I guess you could load its structure into Mathematica, manipulate it, and write it out again!

              I don’t like to see a tool like NP++ being pushed to perform basically ridiculous tasks.

              David

              1 Reply Last reply Reply Quote 0
              • Klaus LehmannK
                Klaus Lehmann
                last edited by

                hi david
                excuse me.
                be happy, that You never had to edit a 10GB-xml-file.
                therefor notepad++ isn’t NOT Your friend ;-)
                but, I want to give y try to ricardo’s 64bit edition! it’s new to me ;-)
                Yours klaus

                1 Reply Last reply Reply Quote 0
                • David BaileyD
                  David Bailey
                  last edited by David Bailey

                  Klaus,

                  “be happy, that You never had to edit a 10GB-xml-file.”

                  I am indeed - but I am also happy that I have never tried to solder electronic components with a blow torch, or fry an egg on a smoothing iron! Even if I had, I wouldn’t try to make suggestions on a blow torch forum regarding improvements to blow torches that might make that more feasible!

                  I think a lot of software gradually bloats out - both in bytes and in terms of complexity. Ultimately valuable software can die that way! I think NP++ has remained consistently focussed on providing solutions for those who need to edit normally sized text files - particularly program source - and trying to add on outlandish capabilities would be a severe distraction.

                  Switching to 64-bits would probably only be the first step in a project to edit 10 GB data files, because I am sure there must be many processes inside NP++ that depend on being able to scan across an entire file in a sensible amount of time.

                  It might be more constructive if you described how you got into this mess, and someone might be able to offer some constructive suggestions!

                  As a preliminary suggestion, I would suggest that you write a C program to read the file, and recognise the data you want to change. You could then modify the program to actually perform a change (but I would back up your file before you start :) ).

                  David

                  1 Reply Last reply Reply Quote 0
                  • RicardoR
                    Ricardo
                    last edited by

                    woot
                    For editing a 10GB file, I guess the best would be an editor that doesn’t load the entire file into memory, but reads it direct from disk. Otherwise, you would need a lot of RAM.

                    Klaus LehmannK 1 Reply Last reply Reply Quote 0
                    • guy038G
                      guy038
                      last edited by guy038

                      Hello Maor Bachar and All,

                      Why don’t you give a try to the old, but excellent, script program GAWK.exe, which can be used, in addition, both, on Unix or Windows machines (with the appropriate executable file, of course ) ?

                      You can get the latest Windows version of gawk.exe, AFAIK, ( v4.1.0 ), at the address, below :

                      https://code.google.com/p/gnu-on-windows/downloads/detail?name=gawk-4.1.0-bin.zip&can=2&q=

                      To get an overview of the main features of gawk and what’s new in the v4.1.0 version, follow the link :

                      http://www.drdobbs.com/open-source/gnu-awk-this-is-not-your-fathers-awk/240158351

                      And, here is, below, the link to download the last reference book, on GAWK, v4.1.x, in various formats :

                      http://www.gnu.org/software/gawk/manual/

                      The PDF form is quite recent : April 2015 !


                      To test it, I created a 1 Gb file. ( I didn’t create a 10 Gb file, as my old Win XP laptop, with only two 40Gb partitions would not accept it ! ). But, when I go back to work , as I’m presently, on holidays, I’ll be able to build bigger files !

                      Briefly, I first created, in N++, a 200 Mb file, about, named xxx.txt. Then, I recopied this file, five times, to the All.txt file, in a DOS windows, with the commands :

                      type xxx.txt>All.txt
                      type xxx.txt>>All.txt
                      type xxx.txt>>All.txt
                      type xxx.txt>>All.txt
                      type xxx.txt>>All.txt
                      

                      I ended the All.txt file with a last line

                      echo END of the FILE>>All.txt
                      

                      Length of the lines are from 0 to 150. Finally, the All.txt file contains, about, 18 498 000 lines, for 1026 Mo bytes !


                      I used the simple gawk script below, named Script.txt. It’s IMPORTANT to note that this file is ANSI encoded

                      #----------------------------------------------------------------------------------------------------------------#
                      #  SYNTAX to RUN, in a CMD windows, from the FOLDER, where are the 3 files GAWK.exe, Script.txt and All.txt      #
                      #            ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯  ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯                        ¯¯¯¯¯¯¯¯  ¯¯¯¯¯¯¯¯¯¯     ¯¯¯¯¯¯¯      #
                      #                                                                                                                #
                      #  SYNTAX :                                                                                                      #
                      #  ¯¯¯¯¯¯                                                                                                        #
                      #                                                                                                                #
                      #    echo [Regex|String]|gawk -f Script.txt - All.txt[ File2.txt[ File3[ ...]]][ >[>]See.txt]                    #
                      #----------------------------------------------------------------------------------------------------------------#
                      
                      {
                        if ( pattern == "" )
                          {
                      
                            #---------------------------------------------------------------------------#
                            #  The STANDARD INPUT -  is read, BEFORE the USER FILE(S), so ALL the TEXT  #
                            #    of the DOS COMMAND 'echo' will be STORED in the VARIABLE 'pattern'     #
                            #                                                                           #
                            #   Note : No BLANK, at END of "echo" command, unless part of the REGEX     #
                            #                                                                           #
                            #  IF the PATTERN is NOT initialized  :                                     #
                            #    SET the VARIABLE 'total' to the value 0                                #
                            #    SET the VARIABLE 'pattern' to the value of the LINE field $0           #
                            #      = ALL the TEXT of the DOS COMMAND 'echo'                             #
                            #    SKIP to the NEXT line to READ                                          #
                            #---------------------------------------------------------------------------#
                      
                            total = 0 ; pattern = $0
                      
                            next
                          }
                      }
                      
                      #------------------------------------------------------------------------------#
                      #    IF the CURRENT line MATCHES the PATTERN of the VARIABLE 'pattern' :       #
                      #      INCREMENT, by ONE, the VARIABLE 'total' [ and PRINT the CURRENT line ]  #
                      #    SKIP to the NEXT line to READ, in ALL cases                               #
                      #------------------------------------------------------------------------------#
                      
                      {
                        if ($0 ~ pattern) { ++total }
                      
                      #   OR     if ($0 ~ pattern) { ++total ; print }
                      
                        next
                      }
                      
                      
                      #-------------------------------------------------------------------------------------------------------------------#
                      #  AFTER ALL the USER files are READ, DISPLAYS the NUMBER of lines, MATCHING the PATTERN of the VARIABLE 'pattern'  #
                      #-------------------------------------------------------------------------------------------------------------------#
                      
                      END { print "\n  Number of lines MATCHING \x22" pattern "\x22  : " , total }
                      

                      The command echo .*|gawk -f Script.txt - all.txt give me the number 18 498 231, which is the number of lines matching the regex .*

                      The command echo END of the FILE|gawk -f Script.txt - all.txt give me the number 1, noticing that it correctly read the last and 18 498 231th line, of the 1 Gb file All.txt !

                      On my old Win XP laptop, with, only, 1GB RAM ( 2 * 512 Mo ), I got the result, from each command, in 1mn 30s about ! Not to bad, isn’t it ?

                      Of course, my script just count the matchings but, with GAWK, you can manipulate files, in multiple ways, do calculus or searches/replacements or … It’s a very powerful tool, although tiny in size : 223 246 bytes, for the v4.1 version.

                      You’ll just have to scan, a bit, the gawk documentation !

                      And, as Ricardo said, it seems, to directly read files, from disk :-) => No limit in size, seemingly

                      Best Regards,

                      guy038

                      1 Reply Last reply Reply Quote 0
                      • Klaus LehmannK
                        Klaus Lehmann @Ricardo
                        last edited by

                        @Ricardo
                        I think crisp (64bit) can do it!
                        but it costs 100-250 per Year!

                        and for my biggest files (with rubbish-xml-gore) in 10GB files, I must to use: VEDIT Pro (64bit)
                        attention: high prize! $240 (no joke!)

                        Yours klaus

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors