Skip to content

Releases: mcaceresb/stata-gtools

gtools-1.10.1

09 Dec 02:03
Compare
Choose a tag to compare

Release update. New commands and functions, several enhancements, and various bug fixes. Remember to run gtools, upgrade to keep up to date between major updates.

Features

  • New function gstats transform (weights, by allowed):
    Applies a transformation to a variable; that is, y_i = f(x_i) with
    y the target and x the source. For example,

    gstats transform (demean) y = x, by(group)
    

    gives

    n_j  = sum_i 1{group_i = j}
    s_j  = sum_i 1{group_i = j} * x_i
    y_ij = x_i - s_j / n_j
    

    available:

      normalize, standardize: f(x)   = (x - mean(x)) / sd(x)
      demean:                 f(x)   = (x - mean(x))
      demedian:               f(x)   = (x - median(x))
      cumsum:                 f(x_i) = sum_{l = 1}^i x_l
      shift:                  f(x_i) = x_{i - lag} or x_{i + lead}
      rank:                   Similar to egen rank; see docs.
      moving:                 Moving statistics; see docs.
      range:                  Similar to rangestat; see docs.
    
  • gstats range: Alias for gstats transform (range); see below.

  • gstats moving: Alias for gstats transform (moving); see below.

  • gstats hdfe (alias gstats residualize): Residualize variable
    by absorbing high-dimensional fixed effects.

    • Currenty in beta! Use with care; see docs for details.

    • Methods cg (Conjugate Gradient), squarem (SQUAREM), it
      (Irons and Tuck), map (Method of Alternating Projections).

    • Parallel execution of select functions can be enabled at compile
      time via GTOOLSOMP

  • gstats transform:

    • gstats transform (demean) ...

    • gstats transform (demedian) ...

    • gstats transform (normalize) ...

    • gstats transform (cumsum [+/- [varname]]) ...: Sums in current order by default. User can request cum sum in ascending or descending order; last, the order can be determined by another variable.

    • gstats transform (rank) ...: Option ties() specifies how to break ties (field, track, unique, stableunique).

    • gstats transform (shift [+/-]#) ...: Leads (default; e.g. shift 1 or shift +3) and lags (e.g. shift -2).

    • gstats transform (moving stat lower upper) ...: Moving statistic stat from current observation + lower until current observation + upper; see docs for details.

    • gstats transform (range stat lower upper varname) ...: Moving statistic stat for values of varname in range
      varname[_n] - lower to varname[_n] + upper. Can also specify a
      statistic, e.g. range sd -1.0sd 1.0sd varname to get all values
      within a standard deviation of varname[_n]. See docs for detauls.

    • gstats transform, auto[()] allows automagically naming
      targets based on the source variable's name and the statistic
      requested. Default is #source#_#stat#.

  • greshape

    • Adds option dropmiss to drop missing rows (case-wise) when
      reshaping long (via long or gather).

    • Closes #58; allows uselabels[(varlist, [exclude])] to optionally
      specify which variables to use labels for (default is all
      variables). The user can also specify the option exclude to
      specify which variables not to do this for.

    • Closes #63: greshape wide/gather allows prefix(...) for custom
      output names.

    • Closes #69. greshape wide/spread now allows labelformat() for
      custom variable labels (only when a single variable is passed to
      key()/j()). The default is #keyvalue# #stublabel#. Available
      placeholders are #stubname#, #stublabel#, #keyname#,
      #keylabel#, #keyvalue#, and #keyvaluelabel#

  • gegen:

    • winsor, winsorize call gstats winsor

    • standardize, normalize, demean, demedian call gstats transform

    • Fixes #67; adds gegen x = rank(varname) [wgt], by(varlist) ties(type)
      via gstats transform (rank) [wgt], by() ties(). Weights are optional.

    • gegen x = moving_stat(y), window(lower upper) calls gstats transform

    • gegen x = range_stat(y), interval(lower[stat] upper[stat] varname) calls gstats transform

  • gcollapse, gegen, gstats tab new functions:

    • geomean for geometric mean.

    • gini, gini dropneg, gini keepneg for gini coefficient
      (optionally drop or keep negative values).

  • noinit option for gcollapse, merge, gegen, gstats (selected),
    gregress (and co.) to prevent targets from being emptied out
    with replace. Prints warning!

Beta

  • Regression models are in beta and not recommended at the moment;
    see docs for details.

Enhancements

  • User must now specify global GTOOLS_BETA to use beta features.

  • Typed (direct/non-hashed) radix sort in API internals

  • Allows the user to specify the temporary directory for files via
    global GTOOLS_TEMPDIR

  • gunique, detail now uses gstats sum, detail

  • Modularized the code base so that aliases are assigned to internal functions instead of the copy/paste if/else branching statements.

  • Categorize documentation into "Data manipulation", "Statistics",
    and "Regression models".

  • Move plugin compilation to GitHub rather than Travis.

  • gtop prints the number of levels in Other and Missing rows by
    default. (With missing it only does it if there's more than
    one type of missing value.)

  • greshape tries to detect repeated stubs and suggests this possibility
    to the user when a stub matches multiple variables.

  • Faster excludeself mean and sum without specified range in gstats transform.

Bug Fixes

  • gstats winsor, exits with error if replace and if/in are passed
    (the way it's set up it'd be a bit of a hassle to allow init/noinit).

  • gstats transform, gstats hdfe, gregress (and co.) all now
    initialize their targets to be empty (missing values) with
    if in and replace.

  • gtop no longer incorrectly replaces the display value if the
    numerical variable has a value label and no missing values. If there
    was a single value this would result in an error: gtop would think
    there was always at least one missing value to replace.

  • gcollapse no longer fails when trying to label the collapsed output
    if the source labels are blank (this can happen for example with data
    transformed to .dta from other formats or programs).

  • gcollapse no longer gives incorrect missing variables list when
    part of that list is called with varlist notation (e.g. x* y
    and x* exist but y does not).

  • gunique no longer ignores if/in with gen and replace

  • Fixed gegen nunique with multiple inputs

  • Fixed bug where the prefix in gstats was stat_ instead of stat|

  • In gquantiles, data was read incorrectly with by() and weights
    if xtile was not requested. In particular, the data was copied as if
    the target had only one column, but since weights need to be included,
    the target has two columns. This was fixed.

  • Fixed bug where a by variable being used as a source but not a
    target got renamed to the target and was no longer available as a by
    variable. Now a new variable should be created and the by variable
    remains unchanged.

  • Fixed memory leak where the C by variables were not cleared from memory
    of st_into->output was allocated because free code was upgraded from 6
    or 7 to 9. Conditional logic in place said that by variables should not
    be cleared if free code was greater than 7, but that was only meant to
    skip free code 8 and free code 9 in some scripts, but not all. Code 8
    logic was deprecated and now by variables are allocated with code 8, so
    they are always clared if free code is 6 or higher.

  • by: gegen now generates variables using the by prefix.
    This would give incorrect answers if the expression inside
    egen assumed that it would be generated with by. For example
    by var: gegen x = mean(max(y, y[1]))

  • Closes #64: Removes head command from greshape tests (done a few
    commits ago but someone noticed before the merge).

  • Closes #68. gegen now allows by: prefix when calling a
    gstats transform function (this is only allowed because these calls
    already require single-variable input, so the by: prefix should not
    present an issue when calling the function).

  • Closes #72: Warning for gegen expressions without by group

  • Closes #74: gstats transform parses abbreviated targets

  • Closes #75: gunique returns 0s in r() when there are no obs

  • Closes #78: if now passed raw/in double-quotes throughout the pipeline

  • Closes #79: Adds disclaimer to benchmarks.

  • Closes #82: cw in gcollapse now working.

  • Closes #84, #83, #73: OSX compilation moved to github.

  • Closes #85: Bug in gegen warning message causes errors in some fun calls.

  • Closes #87: For OSX, make now compiles x86_64 and arm64 separately
    then combines via lipo.

  • Various fixes to the docs.

gtools-1.5.1

24 Mar 20:28
Compare
Choose a tag to compare

Release update. New commands, major features, and various bug fixes. Remember to run gtools, upgrade to keep up to date between major updates.

New Commands

  • gstats winsor is a fast, by-able winsor2 alternative for Winsorizing and trimming data (accepts weights).

  • greshape long and greshape wide are a fast alternative to reshape.

  • greshape spread and greshape gather are analogous to the spread and gather commands from R's tidyr.

  • gstats sum and gstats tab (alias gstats summarize and gstats tabstat) are a fast, by-able alternative to sum, detail and tabstat

Enhancements and Features

  • gstats sum or gstats tab with option matasave; this stores the output and by levels in GstatsOutput (custom naming via matasave(name)), an object of class GtoolsResults.

  • gcollapse and gegen now allow the stats:

    • select# and select-#, for the #th smallest or largest value, respectively.
    • rawselect# and rawselect-#, ibid but ignoring weights.
    • cv, coefficient of variation, sd/mean
    • variance
    • range, max - min
  • greshape features

    • Preferred syntax is by() and keys() instead of i() and j(); the docs and most of the printouts reflect this.
    • greshape tries to save variable labels, notes, and characteristics when reshaping.
    • greshape, uselabels allows the user to save the source variable labels as levels instead of their names.
    • greshape supports @ syntax.
    • greshape wide additionally supports varlist syntax (but the same stub cannot have both @ and a varlist).
    • greshape long does not support varlist syntax, but the user can pass regexes as stubs with the option match(regex). See the documentation for details.
  • glevelsof and gtop features

    • glevelsof and gtop both take option matasave (or matasave(name)) to save the variable levels in a mata object (default name is GtoolsByLevels).
    • With option matasave[(name)], r(levels) is not returned; the levels are stored in printed as part of the mata return object (e.g. GtoolsByLevels.printed). The user can save only the raw levels by also adding the silent option.
    • With option matasave[(name)], both gtop, numfmt() and glevelsof, numfmt() do the number formatting in mata, so numfmt() must pass a mata print format instead of a C print format (they are very similar, however).
    • With option matasave[(name)], gtop does not return r(toplevels) either. The frequency table is stored in toplevels as part of the mata return object (e.g. GtoolsByLevels.toplevels).
    • gtop, ntop(.) prints all the levels from largest to smallest; gtop, ntop(-.) prints from smallest to largest; gtop, alpha prints the largest/smallest ntop() levels sorted in variable order (e.g. alphabetically or numerically, depending on the variable type).
    • gtop also stores r(ntop), r(nrows), and r(alpha) as return scalars; if ntop(.) or ntop(-.) are passed, r(ntop) will just be r(J).
    • Both gtop and glevelsof should handle embedded characters better. Printing is still a problem but they get copied to the return values properly.
  • gstats is a general-purpose wrapper for misc functions.

  • lgtools.mlib added with come pre-compiled mata functios.

  • Any function that allows results to be saved in mata allow the mata object to call .desc() to get more info on the object.

  • Faster hash sort with integer bijection (two-pass radix sorts for smaller integers; undocumented option _ctolerance() allows the user to force the regular counting sort).

  • Faster index copy when every observation is read (simply assign the index pointer to st_info->index)

Bug Fixes

  • Stata 14.0 no longer tries to load SPI version 3 (loads version 2).

  • SpookyHash code compiled directly as part of the plugin. Might fix #35 (deleted all ancillary files and code related to spookyhash.dll).

  • gtop, glevelsof, and gcontract parse wildcards before adding any temporary variables, ensuring the latter don't get included in internal function calls.

  • Removed locale as a dependency; comma printing done manually. This fixes a bug where in certain systems, locale would get reset and cause some internal Stata numbers fo interpret decimals via comma, that is, 95.0 would become 95,0 and cause problems down the line.

  • Minor bug fix in gtop; inverted levels were not correctly sorted with weights. The levels themselves were OK, however.

  • gcollapse no longer crashes when rawstat does not match any entries.

gtools-1.1.2

26 Nov 05:32
Compare
Choose a tag to compare

Release update. Various bug fixes and minor improvements. Remember to run gtools, upgrade to keep up to date between major updates.

Enhancements and Features

  • Improved variable parsing in general (including '-' handling).

  • gcollapse (nmissing) counts the number of missing values (weights allowed).

  • gcollapse (nansum) and gcollapse (rawnansum) preserve missing value (NaN) information: If all entries are missing, the output is also missing (instead of 0). This is a more flexible version of the previous implementation, gcollapse, missing.

  • gcollapse, merge and gegen now accept the undocumented option _subtract to subtract the result from the source variable. This is an option meant for advanced users, so no additional checks have been added. If you use it then you know what you're doing.

  • Added option sumcheck to create sum targets from integer source as the smallest type that reasonable given the total sum.

Bug fixes

  • gisid no longer gives wrong results when the data is partially ordered. Partially (weakly) sorted data in prior version was incorrectly counted as totally sorted.

  • gcollapse (rawsum) gives 0 if all entries are missing.

  • gcollapse and gegen correctly parse types with weights for counts and sums. This includes gcollapse, sumcheck

  • Recast upgrade (bug fix from 1.0.5) now done per-variable.

  • gcollapse no longer gives wrong results when count is the first of multiple stats requested for a given source variable. Previous versions wrongly recast the source as long.

  • Gtools exits with error if _N > 2^31-1 and points the user to the pertinent bug report.

gtools-1.0.1

27 Jul 18:13
Compare
Choose a tag to compare

First official release. Various bug fixes and minor improvements. gtools can now upgrade itself via gtools, upgrade and run test scripts via gtools, test.

gtools-rc3

21 Jul 20:59
Compare
Choose a tag to compare
gtools-rc3 Pre-release
Pre-release

Feature freeze! Third release candidate.

  • Added partial strL variable support (not binary data; see options compress and forcestrl)
  • Added gduplicates as a duplicates replacement.
  • Added option mlast to hashsort to recover the default behavior of gsort.
  • fasterxtile and gquantiles now accept weights (including by())
  • gtop (and gtoplevelsof) now accept weights.
  • gduplicates now accepts weights.
  • glevelsof now accepts option nolocal to skip saving the levels to a local variable; the levels can be stored in a variable list via the option gen().

Once tests are passing from this tag I will submit version 1.0 to SSC.

gtools-rc2

25 Apr 04:43
Compare
Choose a tag to compare
gtools-rc2 Pre-release
Pre-release

Second release candidate. Added skew (Skewness) and kurt (Kurtosis) to gcollapse and gegen; added rawsum and rawstat() to selectively apply weights to gcollapse targets. Added basic debugging info to the code base and improved the comments in-code. Once I improve the test coverage I should be ready to submit to SSC.

gtools-rc1

06 Mar 20:24
Compare
Choose a tag to compare
gtools-rc1 Pre-release
Pre-release

First release candidate for submission to SSC. Now that weights have been added, no major new functionality will appear (though some minor features might be added, and bug fixes will, of course, be incorporated).