xbeta
Large dataset processing

Tips

  • Process files line by line whenever possible. Avoid loading the entire file into memory.

  • If space is an issue, keep the raw data compressed and process the compressed files in place using pipes, removing the need to uncompress them on disk. In most languages, it is possible to read from a pipe (e.g. gzip output) just as one reads from a file.

    • In Perl, you can read from a pipe just as you would a file:

      open (ZIP, 'unzip -p file.zip file.csv |') or die $!;
      open (BZIP, 'bzip2 -dck file.csv.bz2 |') or die $!;
      open (GZIP, 'gzip -dc file.csv.gz |') or die $!;
    • SAS also supports pipes: Suppose data.zip and data.csv.bz2 both contain a single CSV file:

      filename fh1 pipe 'unzip -p data.zip *.csv';
      filename fh2 pipe 'bzip2 -dc data.csv.bz2';

      In the data step, you can use the pipe name as follows:

      data fh1;
          infile fh1 dsd firstobs=2 lrecl=8192;
          [...]

      You can also read multiple files from a single pipe by using the following:

      filename fh pipe 'for i in *.zip; do
          unzip -p $i *.csv;
      done';