ENH: Add support for reading 110-format Stata dta files #58044

cmjcharlton · 2024-03-28T13:56:30Z

closes ENH: Add support for reading Stata 7 (non-SE) format dta files #47176
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This change enables the ability to read 110-format (Stata 7) dta files. A test data file is included in the same style as other supported versions.

mroeschke

Could you add a whatsnew note v3.0.0.rst under Other Enhancements?

cmjcharlton · 2024-03-28T18:39:44Z

I have now added the whatsnew line as requested.

jbrockmendel · 2024-03-28T22:39:10Z

cc @bashtage

bashtage · 2024-03-29T03:18:59Z

Do we have documentation that there are no differences between 110 and 111? This seems to be the assumption here.

cmjcharlton · 2024-03-29T09:14:01Z

There is a difference between 110 and 111 - The 110 format uses the older typlist codes which limit string variables to a maximum of 80 characters. There is official documentation for the 110 format in the Stata 7 manual:

dta_110.txt

However I have not been able to track any down for the already supported 111 format to compare with.

bashtage

Small comment. Is it really the case that there are no difference between 108 and 110 aside from the version number? Have you done a diff of the documentation to verify?

bashtage · 2024-03-29T03:17:52Z

pandas/io/stata.py

@@ -1407,7 +1407,7 @@ def _read_old_header(self, first_char: bytes) -> None:
        self._time_stamp = self._get_time_stamp()

        # descriptors
-        if self._format_version > 108:
+        if self._format_version > 110:


Why was this increased?

This was previously trying to use the newer typlist encoding for everything higher than the 108 format whereas the 110 format kept using the older version (presumably as string variables in Small and Intercooled Stata 7 were still limited to 80 characters). To the best of my knowledge this is the only difference between the 110 and 111 formats.

bashtage · 2024-04-09T20:00:31Z

pandas/io/stata.py

@@ -1408,7 +1408,7 @@ def _read_old_header(self, first_char: bytes) -> None:
        self._time_stamp = self._get_time_stamp()

        # descriptors
-        if self._format_version > 108:
+        if self._format_version > 110:


Can we switch this logic to >= and use only versions that have explicit support?

Yes, I'll update it to >= 111.

cmjcharlton · 2024-04-09T20:54:09Z

Small comment. Is it really the case that there are no difference between 108 and 110 aside from the version number? Have you done a diff of the documentation to verify?

The differences between 108 and 110 are that the maximum variable name length was increased from 8 to 32 characters (https://www.stata.com/stata7/language.html#longnames) and the expansion record field size was increased from 2 to 4 bytes. The display format also now allowed European decimals (https://www.stata.com/stata7/language.html#andmore), but that doesn't make a difference to the file structure.

My understanding is that the 110 and 111 formats are the same, other than different typlist encodings, which would make sense as they're both for Stata 7, but for editions with different limits. It appears that Stata 7/SE was released around a year after Stata 7/IC and Small Stata 7 (see 01feb2002 entry in https://www.stata.com/help.cgi?whatsnew7), which might explain the lack of documentation.

Assuming that the 111 format is implemented correctly then it seems that the 113 format is the same as 111, except that the values used to encode missing values were changed to allow 26 additional missing codes (see https://www.stata.com/help.cgi?whatsnew7to8).

cmjcharlton · 2024-04-11T11:49:22Z

In case it helps, here are the changes that I have determined between each format version (excluding changes to the display format codes) from looking at the available documentation:

102 (confirmed as undocumented but can be inferred from the next version, the Stata 1 manual and a "history of Stata" article)

Data is stored in little-endian format
Supports 2-byte and 4-byte integer variables
Supports 4-byte and 8-byte float point variables
Number of variables stored in 2-byte integer
Number of observations stored in 2-byte integer
Variable and value label names up to 9 characters (including null terminator)
Data and variable labels up to 32 characters (including null terminator)
Value labels up to 8 characters (null terminator is omitted if label is 8 characters)
Variable format information up to 7 characters (including null terminator)
Valid string characters are ASCII codes 1-127
Single missing value supported for each variable type (.)

103 (documented in Stata 2 manual)

Allow choice of little or big-endian bit ordering
Number of observations stored in 4-byte integer
Added str1 to str80 string variable types

104 (documentation not yet located - probably in Stata 3 manual)

Added byte variable type

105 (documented in Stata 4 and 5 manuals)

Added 0 or 17 character time-stamp record stored in 18 characters (including null terminator)
Added expansion fields with 2-byte integer records (used to store variable characteristics)
Storage for variable format information increased to 12 characters (including null terminator)

108 (documented in Stata 6 manual)

Valid string characters are ASCII codes 1-255
Data and variable label length increased to 81 characters (including null terminator)
Value label length is no longer fixed
Underlying missing value code changed for double type

110 (documented in Stata 7 manual)

Maximum variable and value label name increased to 33 characters (including null terminator)
Expansion record length field size increased from 2-byte to 4-byte integer

111 (documentation not found - maybe in Stata 7/SE manual if this exists)

Variable type codes changed to increase maximum size string variable type to str244

113 (documented in on-line help)

Maximum value range reduced for integer types
Allow multiple missing value codes (., .a .. .z) to be supported

114 (documented in on-line help)

Storage for variable format information increased to 49 characters (including null terminator)

115 (documented in on-line help)

Same as 114 (version number increased due to introduction of %tb business date format)

117 (documented in on-line help)

Tagged xml-style structure
Timestamp and dataset labels store their length as first byte, and no longer include null terminator
Stores 8-byte location map for each component of the file structure (after initial release the "varlabs" field was set to zero, but fixed in a subsequent update)
Variable type codes changed to increase maximum fixed string variable type to str2045 and introduce strL types
Removes generic expansion fields and replaces them with "characteristics" component
GSO v field stored as 4-byte integer
GSO o field stored as 4-byte integer
(v,o) for strL variables packed as (4,4) bytes

118 (documented in on-line help)

Number of observations stored in 8-byte integer
Dataset label length stored in 2-byte integer
Storage for variable format information increased to 57 characters (including null terminator)
Strings are now stored as UTF-8, increasing allocation reserved for each character from one byte to four bytes (still null terminated with single character)
GSO o field increased to 8-byte integer
(v,o) for strL variables packed as (2,6) bytes

119 (documented in on-line help)

Number of variables stored in 4-byte integer (as a consequence srtlist records increase from 2-byte to 4-byte integers)
(v,o) for strL variables packed as (3,5) bytes

120 (documented in on-line help)

As 118, but adds alias variable type

121 (documented in on-line help)

As 119, but adds alias variable type

bashtage · 2024-05-08T18:03:07Z

Can you rebase and ping on green?

…d or new typlist version

cmjcharlton · 2024-05-08T21:19:08Z

I have now rebased this, and all checks pass.

bashtage · 2024-05-08T22:37:23Z

Thanks. LGTM.

mroeschke reviewed Mar 28, 2024

View reviewed changes

mroeschke added the IO Stata read_stata, to_stata label Mar 28, 2024

bashtage requested changes Apr 9, 2024

View reviewed changes

cmjcharlton requested a review from bashtage April 16, 2024 09:56

cmjcharlton added 5 commits May 8, 2024 21:27

ENH: Add support for reading 110-format Stata dta files

3533134

Add whatsnew note to v3.0.0.rst

48f98f0

Add a test data file containing value labels

605924b

Compare version number inclusively when determining whether to use ol…

524c28b

…d or new typlist version

Add a big-endian version of the test data set

ee3bae8

cmjcharlton force-pushed the stata-read-dta110 branch from 0028ff9 to ee3bae8 Compare May 8, 2024 20:32

bashtage approved these changes May 8, 2024

View reviewed changes

bashtage merged commit d62d77b into pandas-dev:main May 8, 2024
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add support for reading 110-format Stata dta files #58044

ENH: Add support for reading 110-format Stata dta files #58044

cmjcharlton commented Mar 28, 2024

mroeschke left a comment

cmjcharlton commented Mar 28, 2024

jbrockmendel commented Mar 28, 2024

bashtage commented Mar 29, 2024

cmjcharlton commented Mar 29, 2024 •

edited

bashtage left a comment

bashtage Mar 29, 2024

cmjcharlton Apr 9, 2024

bashtage Apr 9, 2024

cmjcharlton Apr 9, 2024

cmjcharlton commented Apr 9, 2024

cmjcharlton commented Apr 11, 2024

bashtage commented May 8, 2024

cmjcharlton commented May 8, 2024

bashtage commented May 8, 2024

ENH: Add support for reading 110-format Stata dta files #58044

ENH: Add support for reading 110-format Stata dta files #58044

Conversation

cmjcharlton commented Mar 28, 2024

mroeschke left a comment

Choose a reason for hiding this comment

cmjcharlton commented Mar 28, 2024

jbrockmendel commented Mar 28, 2024

bashtage commented Mar 29, 2024

cmjcharlton commented Mar 29, 2024 • edited

bashtage left a comment

Choose a reason for hiding this comment

bashtage Mar 29, 2024

Choose a reason for hiding this comment

cmjcharlton Apr 9, 2024

Choose a reason for hiding this comment

bashtage Apr 9, 2024

Choose a reason for hiding this comment

cmjcharlton Apr 9, 2024

Choose a reason for hiding this comment

cmjcharlton commented Apr 9, 2024

cmjcharlton commented Apr 11, 2024

bashtage commented May 8, 2024

cmjcharlton commented May 8, 2024

bashtage commented May 8, 2024

cmjcharlton commented Mar 29, 2024 •

edited