Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add support for reading 110-format Stata dta files #58044

Merged
merged 5 commits into from May 8, 2024

Conversation

cmjcharlton
Copy link
Contributor

This change enables the ability to read 110-format (Stata 7) dta files. A test data file is included in the same style as other supported versions.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a whatsnew note v3.0.0.rst under Other Enhancements?

@mroeschke mroeschke added the IO Stata read_stata, to_stata label Mar 28, 2024
@cmjcharlton
Copy link
Contributor Author

I have now added the whatsnew line as requested.

@jbrockmendel
Copy link
Member

cc @bashtage

@bashtage
Copy link
Contributor

Do we have documentation that there are no differences between 110 and 111? This seems to be the assumption here.

@cmjcharlton
Copy link
Contributor Author

cmjcharlton commented Mar 29, 2024

There is a difference between 110 and 111 - The 110 format uses the older typlist codes which limit string variables to a maximum of 80 characters. There is official documentation for the 110 format in the Stata 7 manual:

dta_110.txt

However I have not been able to track any down for the already supported 111 format to compare with.

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment. Is it really the case that there are no difference between 108 and 110 aside from the version number? Have you done a diff of the documentation to verify?

@@ -1407,7 +1407,7 @@ def _read_old_header(self, first_char: bytes) -> None:
self._time_stamp = self._get_time_stamp()

# descriptors
if self._format_version > 108:
if self._format_version > 110:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this increased?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously trying to use the newer typlist encoding for everything higher than the 108 format whereas the 110 format kept using the older version (presumably as string variables in Small and Intercooled Stata 7 were still limited to 80 characters). To the best of my knowledge this is the only difference between the 110 and 111 formats.

@@ -1408,7 +1408,7 @@ def _read_old_header(self, first_char: bytes) -> None:
self._time_stamp = self._get_time_stamp()

# descriptors
if self._format_version > 108:
if self._format_version > 110:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we switch this logic to >= and use only versions that have explicit support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll update it to >= 111.

@cmjcharlton
Copy link
Contributor Author

Small comment. Is it really the case that there are no difference between 108 and 110 aside from the version number? Have you done a diff of the documentation to verify?

The differences between 108 and 110 are that the maximum variable name length was increased from 8 to 32 characters (https://www.stata.com/stata7/language.html#longnames) and the expansion record field size was increased from 2 to 4 bytes. The display format also now allowed European decimals (https://www.stata.com/stata7/language.html#andmore), but that doesn't make a difference to the file structure.

My understanding is that the 110 and 111 formats are the same, other than different typlist encodings, which would make sense as they're both for Stata 7, but for editions with different limits. It appears that Stata 7/SE was released around a year after Stata 7/IC and Small Stata 7 (see 01feb2002 entry in https://www.stata.com/help.cgi?whatsnew7), which might explain the lack of documentation.

Assuming that the 111 format is implemented correctly then it seems that the 113 format is the same as 111, except that the values used to encode missing values were changed to allow 26 additional missing codes (see https://www.stata.com/help.cgi?whatsnew7to8).

@cmjcharlton
Copy link
Contributor Author

In case it helps, here are the changes that I have determined between each format version (excluding changes to the display format codes) from looking at the available documentation:

102 (confirmed as undocumented but can be inferred from the next version, the Stata 1 manual and a "history of Stata" article)

  • Data is stored in little-endian format
  • Supports 2-byte and 4-byte integer variables
  • Supports 4-byte and 8-byte float point variables
  • Number of variables stored in 2-byte integer
  • Number of observations stored in 2-byte integer
  • Variable and value label names up to 9 characters (including null terminator)
  • Data and variable labels up to 32 characters (including null terminator)
  • Value labels up to 8 characters (null terminator is omitted if label is 8 characters)
  • Variable format information up to 7 characters (including null terminator)
  • Valid string characters are ASCII codes 1-127
  • Single missing value supported for each variable type (.)

103 (documented in Stata 2 manual)

  • Allow choice of little or big-endian bit ordering
  • Number of observations stored in 4-byte integer
  • Added str1 to str80 string variable types

104 (documentation not yet located - probably in Stata 3 manual)

  • Added byte variable type

105 (documented in Stata 4 and 5 manuals)

  • Added 0 or 17 character time-stamp record stored in 18 characters (including null terminator)
  • Added expansion fields with 2-byte integer records (used to store variable characteristics)
  • Storage for variable format information increased to 12 characters (including null terminator)

108 (documented in Stata 6 manual)

  • Valid string characters are ASCII codes 1-255
  • Data and variable label length increased to 81 characters (including null terminator)
  • Value label length is no longer fixed
  • Underlying missing value code changed for double type

110 (documented in Stata 7 manual)

  • Maximum variable and value label name increased to 33 characters (including null terminator)
  • Expansion record length field size increased from 2-byte to 4-byte integer

111 (documentation not found - maybe in Stata 7/SE manual if this exists)

  • Variable type codes changed to increase maximum size string variable type to str244

113 (documented in on-line help)

  • Maximum value range reduced for integer types
  • Allow multiple missing value codes (., .a .. .z) to be supported

114 (documented in on-line help)

  • Storage for variable format information increased to 49 characters (including null terminator)

115 (documented in on-line help)

  • Same as 114 (version number increased due to introduction of %tb business date format)

117 (documented in on-line help)

  • Tagged xml-style structure
  • Timestamp and dataset labels store their length as first byte, and no longer include null terminator
  • Stores 8-byte location map for each component of the file structure (after initial release the "varlabs" field was set to zero, but fixed in a subsequent update)
  • Variable type codes changed to increase maximum fixed string variable type to str2045 and introduce strL types
  • Removes generic expansion fields and replaces them with "characteristics" component
  • GSO v field stored as 4-byte integer
  • GSO o field stored as 4-byte integer
  • (v,o) for strL variables packed as (4,4) bytes

118 (documented in on-line help)

  • Number of observations stored in 8-byte integer
  • Dataset label length stored in 2-byte integer
  • Storage for variable format information increased to 57 characters (including null terminator)
  • Strings are now stored as UTF-8, increasing allocation reserved for each character from one byte to four bytes (still null terminated with single character)
  • GSO o field increased to 8-byte integer
  • (v,o) for strL variables packed as (2,6) bytes

119 (documented in on-line help)

  • Number of variables stored in 4-byte integer (as a consequence srtlist records increase from 2-byte to 4-byte integers)
  • (v,o) for strL variables packed as (3,5) bytes

120 (documented in on-line help)

  • As 118, but adds alias variable type

121 (documented in on-line help)

  • As 119, but adds alias variable type

@bashtage
Copy link
Contributor

bashtage commented May 8, 2024

Can you rebase and ping on green?

@cmjcharlton
Copy link
Contributor Author

I have now rebased this, and all checks pass.

@bashtage
Copy link
Contributor

bashtage commented May 8, 2024

Thanks. LGTM.

@bashtage bashtage merged commit d62d77b into pandas-dev:main May 8, 2024
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add support for reading Stata 7 (non-SE) format dta files
4 participants