Reads data from a file (P.W. Lane).
Options
PRINT = string tokens |
What output to display (summary , groups , comments , firstline ); default summ , grou , comm , firs |
---|---|
NAME = text |
External name of the data file; no default in batch mode, name is prompted for in interactive mode |
END = text |
What string terminates data; default ':' (the end of file also terminates data for any setting); the setting END=* is not allowed |
MISSING = text |
What character represents missing values; default '*' |
SKIP = scalar or text |
Number of lines to skip at the start of the file, or string to indicate the record before the first record of data; default 0 |
MAXCATEGORY = number |
The maximum number of categories for which a structure is defined to be a factor unless otherwise specified by FGROUPS ; default 10 |
COMMENTSYMBOLS = text |
What characters to treat as introducing comments if found in the first column at the start of the file; default double-quote character (" ) |
IMETHOD = string token |
How identifiers are to be specified for the data structures to be read (supply , read , none ); default supp |
ISAVE = pointer |
To store the identifiers, whether read or supplied, and to provide suffixed identifiers for data with no specified identifiers |
SEPARATOR = text |
What (single) character separates successive values; default is the space character |
Parameters
IDENTIFIER = identifiers |
Names for the data structures that are to be read; these are prompted for if this is unset when running interactively with IMETHOD=supply ; identifiers are redefined if they have been used previously |
---|---|
FGROUPS = string tokens |
Whether to form each data structure into a factor (check , form , leave ); default chec , which causes FILEREAD when running interactively to ask about any structure whose number of distinct values is less than or equal to MAXCATEGORY , and when running in batch to define as factors all structures with MAXCATEGORY or fewer distinct values (note: for compatibility with earlier releases, yes and no can be used as synonyms of form and leave ) |
REPRESENTATION = string tokens |
What representation to assume for each data structure (numbers , characters ); default unset – representation is determined by whether the first value is a number; if set for one structure, this parameter must be set for all structures |
Description
FILEREAD
reads data from a file into suitable structures determined from the data. It can deal with values laid out as follows.
(1) A character file: that is, a normal readable file, or flat file.
(2) Maximum record length of 200 characters.
(3) Contents consist of values for one or more data structures – usually presented as a single rectangular data matrix.
(4) The values for the data structures are recorded in parallel – that is, the first values of all the structures, followed by the second values of all, and so on; usually, each record of the file contains one value of each structure, but multiple values per record and multiple records for each unit can also be dealt with.
(5) Values in a record are separated from each other by the same separator – usually one or more spaces.
(6) Text values must be enclosed in single quotes if they contain a space, comma, backslash, or double-quote; single-quotes must be used only to enclose textual values, or be duplicated as part of a value which is also enclosed in single quotes.
(7) Comments are allowed at the start of the file only if every record to be treated as a comment starts with a double quote, or other specified symbol. Alternatively, a specified number of records at the start of the file can be skipped, or any number of records up to and including a specified string.
(8) Identifiers for the columns of the matrix can be read from the first row of data, as long as they are valid, unsuffixed, Genstat identifiers. An exclamation mark after an identifier signals that the structure is to be set up as a factor.
Information may be numerical or textual. Numerical values are read as variates, and textual as texts, determined by the values in the first complete record or by the REPRESENTATION
parameter. If this parameter is unset, FILEREAD
searches for the first record in the file with no missing values, and fails if there is no such record. If the REPRESENTATION
parameter is set, it determines whether the values of each structure are to be treated as numbers or characters; if set for any structure, this parameter must be set for all of them.
The NAME
option of the procedure supplies the name of the file, enclosed in single quotes. In batch mode the name must be supplied, but in interactive mode, FILEREAD
will prompt for the name if it is not supplied.
The IMETHOD
option controls the specification of identifiers for the structures to be read. With the default, IMETHOD=supply
, the identifiers can be listed using the IDENTIFIER
parameter, one for each column of the data matrix. If IDENTIFIER
is not set when running in interactive mode, FILEREAD
will prompt for identifiers; if it is unset when running in batch mode, FILEREAD
just reports on the contents of the file, unless option ISAVE
is set (see below). If IMETHOD=read
, FILEREAD
will attempt to read identifiers for the data structures from the first complete record in the file (and the IDENTIFIER
parameter is ignored). They must be valid Genstat identifiers, and must not include suffixes. If an exclamation mark is found after (or in) an identifier, then the structure will be set up as a factor unless the FGROUPS
parameter is set to leave
. (This convention matches that used when data is read into a Genstat spreadsheet using menus.) If IMETHOD=none
, FILEREAD
just reports on the contents of the file without assigning identifiers unless option ISAVE
is set.
The ISAVE
option can be set to a pointer to store the identifiers read from the file (if IMETHOD=read
) or supplied interactively (if IMETHOD=supply
). If IMETHOD=none
in either mode, or IMETHOD=supply
and the IDENTIFIER
parameter is unset in batch mode, the data structures can be referred to using the pointer.
Values on the same record of a file must be separated from each other by at least one space unless the SEPARATOR
option is set. This option can nominate any single character to be treated as data separator. The MISSING
and END
options specify the missing-value and end-of-file symbols.
If the number of identifiers is not specified, the number of data structures is taken to be the number of values on the first record with no missing values. But if identifiers are supplied in the IDENTIFIER
parameter, or read from the data file, it is possible to read several units of data from each record, or each unit from several records. If there are more values on the first record of data than there are identifiers, the type of each data structure can be determined only by its first value: FILEREAD
will fail if any first value is missing, unless the REPRESENTATION
parameter is set. If there are fewer values on the first record of data than there are identifiers, FILEREAD
will fail regardless of the absence of missing values unless the REPRESENTATION
parameter is set.
By default, FILEREAD
reports what structures are set up and tabulates the number of values in each category for structures that have MAXCATEGORY
or less distinct values. It also displays any comments that it identifies before the start of the data, and the first record of data that contains no missing values. These four reports are controlled by the PRINT
option.
The FGROUPS
parameter allows structures to be formed automatically into factors. The default setting is check
: in interactive mode, FILEREAD
then prompts for a decision about any structure where the number of distinct values is less than or equal to the setting of the MAXCATEGORY
option; in batch mode, all structures with these few distinct values become factors automatically. FGROUPS
can also be set to form
or leave
to specify explicitly whether each structure should or should not be defined automatically as a factor. (The settings form
or leave
were introduced in Procedure Library PL21 to remove the confusion arising from the fact that other options and parameters that have no as a setting, use no
as their default. However, for compatibility with earlier programs, the settings yes
and no
are still recognised as synonyms for form
and leave
.)
The COMMMENTSYMBOLS
option can be set to a list of single characters, in quotes. If any of these characters is found at the start of a record, before any data has been read, that record will be treated as a comment. By default, the double-quote symbol is the only comment symbol, but it must appear at the start of every record to be treated as a comment.
The SKIP
option allows records at the start of the file to be skipped altogether. It can be set either to the number of records to be skipped, or to a string, indicating that all records are to be skipped up to and including the first record containing that string.
Options: PRINT
, NAME
, END
, MISSING
, SKIP
, MAXCATEGORY
, COMMENTSYMBOLS
, IMETHOD
, ISAVE
, SEPARATOR
.
Parameters: IDENTIFIER
, FGROUPS
, REPRESENTATION
.
Method
The file is opened on the first free input channel. The first record is read as a single string, and then individual items are read from the string into a text. This is tested, and the process repeated until a record has been found that is not blank or a comment, and has no missing items. Items are tested to determine if they are valid numbers, and then the whole file is read into variates and texts as appropriate. Each structure is grouped to provide information about numbers of categories.
See also
Directive: READ
.
Procedures: IMPORT
, DBIMPORT
, SPLOAD
.
Commands for: Input and output.
Example
CAPTION 'FILEREAD example',\ !t('No example can be provided for FILEREAD because',\ 'it needs an external datafile. However, the examples',\ 'below show ways of calling the procedure.'); STYLE=meta,plain SCALAR Chan ENQUIRE Chan; FILETYPE=output; OUTSTYLE=Style OUTPUT [STYLE=plain] PRINT [SQUASH=yes] '\ FILEREAD - In interactive mode, the procedure will prompt for a file name and identifiers, and whether to turn into factors any structures found to have 10 or fewer categories. This statement will fail in batch mode. FILEREAD [NAME=''abc.dat''] - In interactive mode, as above for file abc.dat. In batch mode, report the contents of file abc.dat; however the data cannot then be referred to as no identifiers are given. FILEREAD [NAME=''abc.dat''; IMETHOD=read; ISAVE=data] - In either mode, read identifiers from the first complete data record, then deal with factors as above. The data can be referred to either by the identifiers read, or as data[1], data[2] and so on. FILEREAD [PRINT=*; NAME=''abc.dat''; SKIP=5; COMMENT=''!''] A,B,C - Read the data in file abc.dat into variates or text structures called A, B and C, without any reports. The first five records are skipped, and any subsequent records beginning with exclamation mark until an uncommented record with data is found. Formation of factors is dealt with as above. FILEREAD [NAME=''abc.dat''] X,Y,F,Z; FGROUPS=leave,leave,form,leave - Read the data in file abc.dat into data structures called X, Y, F and Z, redefining F to be a factor. FILEREAD [NAME=''abc.dat''; SEPARATOR='',''] A,B; REPRESENT=characters - Read the data in file abc.dat into data structures called A and B, assuming that values on the same record are separated by commas, and that both structures are to be texts. Each record may contain any number of values of the structures, as long as they are in parallel: that is, one for A, then one for B, and so on. FILEREAD [NAME=''abc.dat''; SKIP=''DATA3''; END=''DATA''] A,B - Read the data in file abc.dat into data structures called A and B, starting from the first record after the record containing the string DATA3, and finishing with the next record containing the string DATA. '; JUST=left; JUST=left; SKIP=0 OUTPUT [STYLE=#Style] SKIP [FILETYPE=output] 1