Reads data from a foreign file format, and loads it or converts it to a spreadsheet file (D.B. Baird).
Options
PRINT = string token |
What information to print (catalogue , summary ); default cata |
---|---|
OUTTYPE = string token |
Output file type (GEN , GSH , GWB , XLS , XLSX , TXT , CSV , SHEETS ); default GWB |
METHOD = string token |
Whether to load data into the Genstat server after creating the file, or whether merely to create the file (create , load ); default load |
IMETHOD = string token |
How identifiers are to be specified for the columns (read , supply , none ); default supply if COLUMNS is set (and specifies names rather than just types), otherwise read |
ENDSTATEMENT = string token |
Ending statement for a type GEN output file (return , endbreak ); default retu |
SPSSMV = string token |
What to do with SPSS missing value codes (ignore , convert ); default conv |
MISSING = text |
What labels represent missing values in Excel, Quattro or Lotus files; default '*' |
FORDER = string token |
The order in which to define the labels or levels of a factor read in (sorted , unsorted ); default sort |
TEXTCONVERSION = string token |
How to convert text to numbers for the columns (strict , single , common , standard , lax ); default stan |
KEEPEMPTY = string tokens |
Whether to retain any empty rows or columns found in the data (rows , columns , none ); default none |
NAMEROW = scalar |
The row number within an Excel or Quattro spreadsheet which contains the column names (IMETHOD must be unset or set to read ); default, the first row in CELLRANGE |
EMETHOD = string token |
Whether to read column descriptions/extra from Excel, SigmaPlot or Quattro spreadsheets (read , none ); default none |
EXTRAROW = scalar |
The row number within an Excel or Quattro spreadsheet which contains the column descriptions (EMETHOD must be set to read ); default, the second row in CELLRANGE |
PREFIX = text |
The string with which to prefix numerical column names; default '%' |
TEMPMISSING = string token |
Whether to read in temporarily missing values as missing (yes , no ); default no |
INOPTIONS = text |
Optional input file arguments to be passed to the Dataload.dll |
OUTOPTIONS = text |
Optional output file arguments to be passed to the Dataload.dll |
RGBMETHOD = string token |
How to read colour values (combined , separate , matrix ); default sepa |
SEPARATORS = text |
Alternative separators to use in text or csv files |
SCOPE = string token |
Whether to create the data locally in a procedure that is using IMPORT , or globally in the whole program (local , global ); default loca |
IPREFIX = text |
Prefix to use with unnamed columns, default 'C' |
TRANSPOSE = string token |
Whether to transpose the rows and columns of the input file (yes , no ); default no |
UNICODE = string token |
What to do with Unicode characters found e.g. in an Excel XLSX input file (utf8 , typeset , ascii , remove ); default utf8 |
COLUNICODENAMES = string token |
How to convert Unicode column names (suffix , extra , ignore ) default suff |
UNINAME = text |
Name of the pointer for Unicode column names used as suffixes; default ‘C ‘ |
Parameters
FILE = texts |
Input file or URL to be read |
---|---|
OUTFILE = texts |
Name of the output file to be created; if this is not provided a temporary file will be created, and then deleted if the data is loaded |
SHEETNAME = texts or scalars |
Name of a spreadsheet worksheet or named range, or number of a worksheet within the file; default is the first sheet in the file |
CELLRANGE = texts |
Cell range within a worksheet, giving the top left and bottom right cell in the format XXNN:XXNN where XX = A – IV , NN = 1 – 64384; default * requests all data on the sheet |
COLUMNS = texts |
Names and/or type codes for the columns read (the type of column can be forced by ending the column name, if supplied, with the code ! for a factor, # for a variate, and $ for a text), using a name of '*' will cause a column to be dropped |
ISAVE = pointers |
Saves the identifiers of the columns |
START = texts |
Contents of a cell in a spreadsheet file or a line in a text file from which to start reading |
END = texts |
Contents of a cell in a spreadsheet file or a line in a text file at which to end reading |
ANCILLARY = texts |
Extra information returned by some file formats (currently only population type from QTL location files) |
ROWSELECTION = variates |
Numbers of the rows to import; if unset, all rows are imported |
COLSELECTION = variates or texts |
Numbers or names of the columns to import; if unset, all the columns are imported |
Description
The name of the file, containing the data values to be imported, is specified by the FILE
parameter. This can also be an internet URL prefixed with http://, https:// ftp:// or file://. The data source is then downloaded and imported.
Data in the supported file formats are extracted and saved in the specified file format, depending on the extension of OUTFILE
. If this is not provided, the type is indicated by the OUTTYPE
option, as either GEN
(Genstat Command file), GSH
(Genstat Spreadsheet), GWB
(Genstat Spreadsheet Book), XLS
(Excel 5 Spreadsheet), XLSX
(Excel 2007 Spreadsheet), TXT
(ASCII Text file) or CSV
(comma-delimited file); the default is GSH
. Setting OUTTYPE=SHEETS
reads in the worksheet names in a spreadsheet file (Excel/Quattro/Sigmaplot or SAS Transport) into a text named Worksheets
. The ENDSTATEMENT
option specifies the ending statement type for a type GEN
output file: either RETURN
(the default) or ENDBREAK
). You can set ENDSTATEMENT=*
if you do not want an ending statement.
The PRINT
option controls printed output, with the following settings:
catalogue |
lists the contents of the file (default); and |
---|---|
summary |
prints a summary of the values in each data structure in the file. |
If METHOD=load
, the resulting file is read in to Genstat data structures. When IMPORT
is used within a procedure, the SCOPE
option controls whether the structures are created locally in the procedure (default), or globally in the main program.
In spreadsheet files (Excel, Quattro, 123, SigmaPlot), the SHEETNAME
and CELLRANGE
parameters can be used to read in just a specified section of the data in the file. If CELLRANGE
specifies only the starting cell, IMPORT
reads all columns the from the given column onwards, and all rows from the given row downwards. For example, CELLRANGE
='C8'
reads columns C, D… onwards, and rows 8, 9… downwards, until the end of the data in the sheet. The COLUMNS
parameter can be used to set the names and types of the structures (see below).
In spreadsheet files, the data that are extracted are labels, numerical values and the results from formulae. A label of *
in an otherwise numerical column is taken as a missing value, unless one or more different missing value markers are specified with the MISSING
option. Empty cells are taken as missing values. Empty rows at the start, middle and end of a block are removed. Empty columns are ignored by default; you can set the option KEEPEMPTY=rows
or KEEPEMPTY=columns
to retain empty rows or columns respectively, or KEEPEMPTY=rows,columns
to keep both.
The IMETHOD
option indicates how identifiers are to be specified for the columns, with the following settings:
read |
assumes that the names are in the first non-empty row of data; |
---|---|
supply |
assumes that the names are supplied by the COLUMNS parameter, but uses default names if they are not; and |
none |
uses default names; and |
overlay |
uses the names from the COLUMNS parameter, or from the first non-empty row of data for any of those names that is blank. |
If IMETHOD=read
, and a column name cell contains a numerical value rather than a label, the column name is set to the numerical value prefixed with a % character. The prefix can be changed using the PREFIX
option: a column named '15'
is given the name %15
by default but, if PREFIX='X'
, the name would be X15
.
The default for IMETHOD
is to take the names from the COLUMNS
parameter, if this is set and it contains names. Otherwise IMPORT
looks for names in the data file (as with the read
setting).
The default column names have the prefix C and an integer number (i.e. C1
, C2
etc.), but you can supply your own prefix using the IPREFIX
option.
Using the COLUMNS
parameter, the type of a column can be forced by providing a !
, #
or $
character on the end of the text items provided for the column names. A string '*'
can be given as a name in COLUMNS
, to remove a column from the data read in. If only a single type character is given, only the types of the columns (and not their names) are changed. The extension :D
on a column name specifies that the values are to be read as dates. Similarly, when the column names are being read from a spreadsheet, their types can be specified by using !
for a factor, #
for a variate, $
for a text and :D
for a date.
The option FORDER
controls the order in which that the labels or levels of a factors are stored. with the default, FORDER=sorted
, the levels are stored in ascending numerical order, and the labels are stored in alphabetical order. Alternatively, if FORDER=unsorted
the levels and labels are stored in the order in which they are first met in the column.
The TEXTCONVERSION
option controls how labels are converted to numbers in a column marked as a variate:
strict |
only labels that contain numeric data only are converted (e.g. '10' becomes 10; '1O' becomes * ) |
---|---|
single |
a single character substitution is read as a number (o or O become 0; i , I , l or L become 1; s or S become 2; z or Z become 5; comma becomes decimal point) |
common |
multiple substitutions as in single are made (e.g. 'Io' becomes 10; '23X' becomes * ) |
standard |
as in common but extra text is ignored at the end of the number (e.g. '23X' becomes 23; 'A2X3' becomes * ) |
lax |
any digits are read from the text (e.g. 'A2X3' becomes 23). |
You can set option EMETHOD=read
, to read a row of column descriptions/extra from a spreadsheet file. By default, this row is taken as the second row in CELLRANGE
. The EXTRAROW
option can be used to modify the row form which the description is read. The row number is relative to the start of the cell range, unless a negative row number is provided; the descriptions are then read from the row in the spreadsheet, corresponding to the absolute value of the specified row number. If EXTRAROW=1
, the column names are read from the second row.
The START
parameter can supply a text to indicate where to start reading within a spreadsheet or text file. In a spreadsheet file (Excel, Quattro, Lotus), the cells from A1 are searched row by row, until a label is found that matches the text. Only cells below and to the right of this cell are then imported. The text could thus be the name of the first variable to be read. Note that the text must not contain spaces or the division symbol (/
). Similarly the END
parameter can supply a text to indicate where to stop reading a spreadsheet or text file.
The TEMPMISSING
option controls the input of temporarily missing values. These are values that have been set to missing temporarily in the spreadsheet, and for which the original (non-missing) values are still available. The default is to read the original values, but you can set TEMPMISSING=yes
to read them as missing values instead.
The INOPTIONS
and OUTOPTIONS
options allow extra options to be passed to Dataload.dll
. For example: setting INOPTIONS='/k'
keeps leading and trailing and doubled blanks in strings, OUTOPTIONS='/u'
creates undecorated names in a CSV file (i.e. 'Factor'
, rather than 'Factor!'
), OUTOPTIONS='/c'
combines the three columns Red
, Green
and Blue
in a BMP file into a single column RGB
, and INOPTIONS='/m'
loads the data as a matrix rather than as separate columns.
The RGBMETHOD
option controls how to represent colour values from image files (JPG, GIF, TIF or PNG). The default setting, combined
, stores an RGB value in a single column in the same form as generated by the RGB
function. The separate
setting creates three columns containing the red, green and blue values, respectively. Finally, the matrix
setting puts the RGB values into a matrix.
The ROWSELECTION
and COLSELECTION
parameters allow you to import only a subset of the rows or columns, respectively, in the file. They can be set to a variate containing the numbers of the rows or columns. With COLSELECTION
, you can also supply a text containing column names. So, for example, to import only rows where the variate X
is greater than zero, you could put
ROWSELECTION = WHERE(X.GT.0)
(the WHERE
function gives the unit numbers where a logical expression has the value one i.e. true). Note that the variate X
must already have been imported into Genstat, but you could import this column on its own using COLSELECTION
. If ROWSELECTION
(or COLSELECTION
) are unset, all the rows (or columns) are imported.
The UNICODE
option controls what happens to Unicode characters that are not part of the extended ASCII character set. These may occur, for example, in Excel XLSX files. The default setting, utf8
, converts them into the UTF-8 format. In this format, the ASCII characters are stored in the usual way, in a single byte (of eight binary bits). More complicated characters, such as Chinese and Thai characters, require up to four bytes. UTF-8 characters cause no problems with most of the Genstat commands. The commands that cannot handle them, for example EDIT
, issue a VA-43
fault. The remove
setting removes these from the input. The ascii
option converts them to the nearest matching ASCII character. The typeset
option converts those that can be represented by Genstat typesetting strings by these strings: for example, α would be converted to ~{alpha}, and √ would be converted to ~{sqrt}. The correspondence between the Greek, ASCII and type-setting commands is shown in the table below. The capitals have the same correspondence. Extended Greek and Latin letters have their accents removed, as there are no type-setting commands for these. Some symbols like ♂ and ♀ are converted to their text equivalent (male and female).
α |
a |
~{alpha} |
ι |
I |
~{iota} |
ρ |
r |
~{rho} |
β |
b |
~{beta} |
κ |
k |
~{kappa} |
σ |
s |
~{sigma} |
γ |
g |
~{gamma} |
λ |
l |
~{lambda} |
τ |
t |
~{tau} |
δ |
d |
~{delta} |
μ |
m |
~{mu} |
υ |
u |
~{upsilon} |
ε |
e |
~{epsilon} |
ν |
n |
~{nu} |
φ |
f |
~{phi} |
ζ |
z |
~{zeta} |
ξ |
c |
~{xi} |
χ |
x |
~{chi} |
η |
h |
~{eta} |
ο |
o |
~{omicron} |
ψ |
y |
~{psi} |
θ |
q |
~{theta} |
π |
p |
~{pi} |
ω |
w |
~{omega} |
The COLUNICODENAMES
option controls how column names that contain Unicode characters are used. With the default setting, suffix
, a pointer is defined to hold any columns with Unicode in their names, and the column names provide its suffix labels. The name of the pointer is specified in a text by the UNINAME
option (default ‘C
‘). The extra
setting uses the default names for the columns, and the column names from the file are used as extra texts. It also sets the IPRINT
attribute of the columns to extra
, so that these are printed instead of the default identifiers. (You can modify this to print the default idendifiers instead, by using the Identifying information used in output list box for those columns in the spreadsheet Column Attributes/Format menu.) The ignore
setting removes the Unicode characters from the name.
(Note: IMPORT
replaces the procedure DATALOAD
from earlier editions of Genstat.)
Options: PRINT
, OUTTYPE
, METHOD
, IMETHOD
, ENDSTATEMENT
, SPSSMV
, MISSING
, FORDER
, TEXTCONVERSION
, KEEPEMPTY
, NAMEROW
, EMETHOD
, EXTRAROW
, PREFIX
, TEMPMISSING
, INOPTIONS
, OUTOPTIONS
, RGBMETHOD
, SEPARATORS
, SCOPE
, IPREFIX
, TRANSPOSE
, UNICODE
, COLUNICODENAMES
, UNINAME
.
Parameters: FILE
, OUTFILE
, SHEETNAME
, CELLRANGE
, COLUMNS
, ISAVE
, START
, END
, ANCILLARY
, ROWSELECTION
, COLSELECTION
.
Method
The request is passed to the DATALOAD.DLL
library which reads the foreign file and returns any valid data found in a temporary GEN
or GSH
file. The following file types are supported: Excel 2-5, 95, 97, 2000, XP, 2003, 2007-2013, Open Office, Lotus WK1, Quattro (WQ1, WB*, QPW), dBase 2-5, Paradox 3-9, Genstat GSH and GWB, SAS PC 6.03-12, 7-9, SAS Transport, SAS JMP, Minitab 8-17, Statistica 5 and 6, Systat, MStat, Instat, Epi-Info, SPSS/Win, Gauss Data/Matrix (PC/Win/Unix), MatLab, S+ (PC/Unix), Stata 4-8, StatGraphics, R data frames, Weka Attribute files, SigmaPlot 7-9, OSIRIS, Limdep, Comma delimited text files (*.CSV), Cornell Ecology format, MapQTL trait files (.QUA), ArcView/Info Shapefiles, MapInfo Exchange files, Windows Bitmap (*.BMP), Windows Sound (*.WAV), NMR Binary files and image files (JPG, GIF, TIF, PNG). The file type is worked out from the file contents, so the usual extension need not be used with the exception of the following file types which do not contain a unique signature: Epi-Info (.REC), S+ (.SDD) and Paradox (.DB). Any files not containing a unique file signature, but ending in these extensions will be classified as above. Any other file extensions will attempted to be read as a comma, space or tab delimited text file.
There is a known problem that using the OUTTYPE=GEN
inside a FOR
loop (or another other procedure) ties up input channels until exiting the FOR
loop. Thus it may exhaust the available input channels. Either use the OUTTYPE=GSH
or set LOAD=no
and write code to input the files created outside the loop (you will need to provide an output file name to do this).
Action with RESTRICT
Restrictions are not applicable to any of the parameters.
See also
Directive: SPLOAD
.
Procedures: EXPORT
, GRIBIMPORT
.
Commands for: Input and output.
Example
CAPTION 'AUDISPLAY example',\ 'Data from Genstat 5 Release 1 Reference Manual, page 340.';\ STYLE=meta,plain FACTOR [NVALUES=36; LEVELS=3; VALUES=12(1...3)] Block FACTOR [NVALUES=36; LABELS=!t(baresoil,emerald,emergo)] Leachate & [LABELS=!t('1','1/4','1/16','1/64')] Dilution VARIATE [NVALUES=36] Nhatch,Nnohatch READ Leachate,Dilution,Nhatch,Nnohatch 1 2 109 318 3 4 54 350 3 1 * 415 2 2 783 212 3 3 652 1375 2 4 490 816 1 3 95 1219 2 1 1012 66 1 4 166 943 3 2 1059 313 1 1 257 1006 2 3 1058 234 2 4 507 1119 1 2 194 840 1 3 175 1707 1 1 326 609 3 4 142 980 2 3 286 230 3 2 546 313 2 2 * 301 2 1 2471 112 3 3 76 489 1 4 208 503 3 1 * 325 1 1 322 913 1 2 255 2246 3 2 1774 1446 2 2 999 193 2 4 388 1836 3 4 221 1800 1 3 220 1902 2 1 2821 187 3 1 1486 463 3 3 717 1473 1 4 143 941 2 3 968 550 : CALCULATE Logit%h = LOG(Nhatch/Nnohatch) BLOCKSTRUCTURE Block TREATMENTSTRUCTURE Leachate*Dilution AUNBALANCED [PRINT=*] Logit%h AUDISPLAY