IPCA Package Documentation

This package provides a Python (3.6+) implementation of the Instrumented Principal Components Analysis framework by Kelly, Pruitt, Su (2017) [1, 2].

class ipca.InstrumentedPCA(n_factors=1, intercept=False, max_iter=10000, iter_tol=1e-05, alpha=0.0, l1_ratio=1.0, n_jobs=1, backend='loky')[source]

Bases: sklearn.base.BaseEstimator

This class implements the IPCA algorithm by Kelly, Pruitt, Su (2017).

Parameters
  • n_factors (int, default=1) – The total number of factors to estimate. Note, the number of estimated factors is automatically reduced by the number of pre-specified factors. For example, if n_factors = 2 and one pre-specified factor is passed, then InstrumentedPCA will estimate one factor estimated in addition to the pre-specified factor.

  • intercept (boolean, default=False) – Determines whether the model is estimated with or without an intercept

  • max_iter (int, default=10000) – Maximum number of alternating least squares updates before the estimation is stopped

  • iter_tol (float, default=10e-6) – Tolerance threshold for stopping the alternating least squares procedure

  • alpha (scalar) – Regularizing constant for Gamma estimation. If this is set to zero then the estimation defaults to non-regularized.

  • l1_ratio (scalar) – Ratio of l1 and l2 penalties for elastic net Gamma fit.

  • n_jobs (scalar) – number of jobs for F step estimation in ALS, if set to one no parallelization is done

  • backend (str) – label for Joblib backend used for F step in ALS

BS_Walpha(ndraws=1000, n_jobs=1, backend='loky')[source]

Bootstrap inference on the hypothesis Gamma_alpha = 0

Parameters
  • ndraws (integer, default=1000) – Number of bootstrap draws and re-estimations to be performed

  • backend (optional) – Value is either ‘loky’ or ‘multiprocessing’

  • n_jobs (integer) – Number of workers to be used. If -1, all available workers are used.

Returns

pval – P-value from the hypothesis test H0: Gamma_alpha=0

Return type

float

BS_Wbeta(l, ndraws=1000, n_jobs=1, backend='loky')[source]

Test of instrument significance. Bootstrap inference on the hypothesis l-th column of Gamma_beta = 0.

Parameters
  • l (integer) – Position of the characteristics for which the bootstrap is to be carried out. For example, if there are 10 characteristics, l is in the range 0 to 9 (left-/right-inclusive).

  • ndraws (integer, default=1000) – Number of bootstrap draws and re-estimations to be performed

  • n_jobs (integer) – Number of workers to be used for multiprocessing. If -1, all available Workers are used.

  • backend (optional)

Returns

pval – P-value from the hypothesis test H0: Gamma_alpha=0

Return type

float

BS_Wdelta(ndraws=1000, n_jobs=1, backend='loky')[source]

Test of PSF significance. Bootstrap inference on the hypothesis Gamma_delta = 0. Assumes that only one PSF is used and no intercept is in use. :Parameters: * ndraws (integer, default=1000) – Number of bootstrap draws and re-estimations to be performed

  • n_jobs (integer) – Number of workers to be used for multiprocessing. If -1, all available Workers are used.

  • backend (optional)

Returns

pval – P-value from the hypothesis test H0: Gamma_delta=0

Return type

float

fit(X, y, indices=None, PSF=None, Gamma=None, Factors=None, data_type='portfolio', label_ind=False, **kwargs)[source]

Fits the regressor to the data using an alternating least squares scheme.

Parameters
  • X (numpy array or pandas DataFrame) – matrix of characteristics where each row corresponds to a entity-time pair in indices. The number of characteristics (columns here) used as instruments is L.

    If given as a DataFrame, we assume that it contains a MutliIndex mapping to each entity-time pair

  • y (numpy array or pandas Series) – dependent variable where indices correspond to those in X

    If given as a Series, we assume that it contains a MutliIndex mapping to each entity-time pair

  • indices (numpy array, optional) – array containing the panel indices. Should consist of two columns:

    • Column 1: entity id (i)

    • Column 2: time index (t)

    The panel may be unbalanced. The number of unique entities is n_samples, the number of unique dates is T, and the number of characteristics used as instruments is L.

  • PSF (numpy array, optional) – Set of pre-specified factors as matrix of dimension (M, T)

  • Gamma (numpy array, optional) – If provided, starting values for Gamma (see Notes)

  • Factors (numpy array) – If provided, starting values for Factors (see Notes)

  • data_type (str) – label for data-type used for ALS estimation, one of the following:

    1. panel

    ALS uses the untransformed X and y for the estimation.

    This is currently marginally slower than the portfolio estimation but is necessary when performing regularized estimation (alpha > 0).

    1. portfolio

    ALS uses a matrix of characteristic weighted portfolios (Q) as well as a matrix of weights (W) and count of non-missing observations for each time period (val_obs) for the estimation.

    See _build_portfolio for details on how these variables are formed from the initial X and y.

    Currently, the bootstrap procedure is only implemented in terms of the portfolio data_type.

Returns

Return type

self

Notes

Updates InstrumentedPCA instances to include param estimates:

Gammanumpy array

Array with dimensions (L, n_factors) containing the mapping between characteristics and factors loadings. If there are M many pre-specified factors in the model then the matrix returned is of dimension (L, (n_factors+M)). If an intercept is included in the model, its loadings are returned in the last column of Gamma.

Factorsnumpy array

Array with dimensions (n_factors, T) containing the estimated factors. If pre-specified factors were passed the returned array is of dimension ((n_factors - M), T), corresponding to the n_factors - M many factors estimated on top of the pre-specified ones.

fit_path(X, y, indices=None, PSF=None, alpha_l=None, n_splits=10, split_method=<class 'sklearn.model_selection._split.GroupKFold'>, n_jobs=1, backend='loky', **kwargs)[source]

Fit a path of elastic net fits for various regularizing constants

Parameters
  • X (numpy array or pandas DataFrame) – matrix of characteristics where each row corresponds to a entity-time pair in indices. The number of characteristics (columns here) used as instruments is L.

    If given as a DataFrame, we assume that it contains a mutliindex mapping to each entity-time pair

  • y (numpy array or pandas Series) – dependent variable where indices correspond to those in X

    If given as a Series, we assume that it contains a mutliindex mapping to each entity-time pair

  • indices (numpy array, optional) – array containing the panel indices. Should consist of two columns:

    • Column 1: entity id (i)

    • Column 2: time index (t)

    The panel may be unbalanced. The number of unique entities is n_samples, the number of unique dates is T, and the number of characteristics used as instruments is L.

  • PSF (numpy array, optional) – Set of pre-specified factors as matrix of dimension (M, T)

  • alpha_l (iterable, optional) – list of regularizing constants to use for path

  • n_splits (scalar) – number of CV partitions

  • split_method (sklearn cross-validation generator factory) – method to generate CV partitions

  • n_jobs (scalar) – number of jobs for parallel CV estimation

  • backend (str) – label for joblib backend

Returns

cvmse – array of dim (P x (C + 1)) where P is the number of reg constants and C is the number of CV partitions

Return type

numpy matrix

get_factors(label_ind=False)[source]

returns a tuple containing Gamma and Factors

Parameters

label_ind (bool) – if provided we return the factors as pandas DataFrames with index info applied

Returns

containing Gamma and Factors

Return type

tuple

predict(X=None, indices=None, W=None, mean_factor=False, data_type='panel', label_ind=False)[source]

wrapper around different data type predict methods

Parameters
  • X (numpy array or pandas DataFrame, optional) – matrix of characteristics where each row corresponds to a entity-time pair in indices. The number of characteristics (columns here) used as instruments is L.

    If given as a DataFrame, we assume that it contains a mutliindex mapping to each entity-time pair

    If None we use the values associated with the current model

  • indices (numpy array, optional) – array containing the panel indices. Should consist of two columns:

    • Column 1: entity id (i)

    • Column 2: time index (t)

    The panel may be unbalanced. The number of unique entities is n_samples, the number of unique dates is T, and the number of characteristics used as instruments is L.

    If None we use the values associated with the current model

  • W (numpy array, optional) – portfolio weight matrix of dimension (L, L, T)

    If None, we use the values associated with the current model

  • mean_factor (boolean) – If true, the estimated factors are averaged in the time-series before prediction.

  • data_type (str) – label for data-type used for prediction, one of the following:

    1. panel

    Uses the untransformed X and y for the estimation.

    1. portfolio

    Uses a matrix of characteristic weighted portfolios (Q) as well as a matrix of weights (W) and count of non-missing observations for each time period (val_obs) for the estimation.

    See _build_portfolio for details on how these variables are formed from the initial X and y.

  • label_ind (bool) – whether to apply the indices to fitted values and return pandas Series

Returns

The exact value returned depends on two things:

  1. The data_type

    a. If panel data_type is specified, this will be a series of values for the panel ys

    b. If portfolio data_type is specified, this will a matrix of predicted char formed portfolio Qs

  2. label_ind

    If label_ind is True, we return pandas variants of the predicted values. If not, we return the underlying numpy arrays.

Return type

numpy array or pandas DataFrame/Series

predictOOS(X=None, y=None, indices=None, mean_factor=False)[source]

Predicts time t+1 observation using an out-of-sample design.

Parameters
  • X (numpy array) – X of stacked data. Each row corresponds to an observation (i,t) where i denotes the entity index and t denotes the time index. All data must correspond to time t, i.e. all observations occur on the same date. If an observation contains missing data NaN will be returned. Note that the number of characteristics (L) passed, has to match the number of characteristics used when fitting the regressor. The columns of the panel are organized in the following order:

    • Column 1: entity id (i)

    • Column 2: time index (t)

    • Column 3 to column 3+L: characteristics.

  • y (numpy array) – dependent variable where indices correspond to those in X

  • indices (numpy array) – array containing the panel indices. Should consist of two columns:

    • Column 1: entity id (i)

    • Column 2: time index (t)

    The panel may be unbalanced. The number of unique entities is n_samples, the number of unique dates is T, and the number of characteristics used as instruments is L.

  • mean_factor (boolean) – If true, the estimated factors are averaged in the time-series before prediction.

Returns

ypred – The length of the returned array matches the the length of data. A nan will be returned if there is missing characteristics information.

Return type

numpy array

predict_panel(X, indices, T, mean_factor=False)[source]

Predicts fitted values for a previously fitted regressor + panel data

Parameters
  • X (numpy array) – matrix of characteristics where each row corresponds to a entity-time pair in indices. The number of characteristics (columns here) used as instruments is L.

  • indices (numpy array) – array containing the panel indices. Should consist of two columns:

    • Column 1: entity id (i)

    • Column 2: time index (t)

    The panel may be unbalanced. The number of unique entities is n_samples, the number of unique dates is T, and the number of characteristics used as instruments is L.

  • T (scalar) – number of time periods in X

  • mean_factor (boolean) – If true, the estimated factors are averaged in the time-series before prediction.

Returns

ypred – The length of the returned array matches the length of data. A nan will be returned if there is missing chars information.

Return type

numpy array

predict_portfolio(W, L, T, mean_factor=False)[source]

Predicts fitted values for a previously fitted regressor + portfolios

Parameters
  • W (numpy array) – portfolio weight matrix of dimension (L, L, T)

  • L (scalar) – number of characteristics

  • T (scalar) – number of time periods

  • mean_factor (boolean) – If true, the estimated factors are averaged in the time-series before prediction.

Returns

Qpred – Same dimensions as a char formed portfolios (Q)

Return type

numpy array

score(X, y=None, indices=None, mean_factor=False, data_type='panel')[source]

generate R^2

Parameters
  • X (numpy array or pandas DataFrame) – matrix of characteristics where each row corresponds to a entity-time pair in indices. The number of characteristics (columns here) used as instruments is L.

    If given as a DataFrame, we assume that it contains a mutliindex mapping to each entity-time pair

    If None we use the values associated with the current model

  • y (numpy array or pandas Series, optional) – dependent variable where indices correspond to those in X

    If given as a Series, we assume that it contains a mutliindex mapping to each entity-time pair

  • indices (numpy array, optional) – array containing the panel indices. Should consist of two columns:

    • Column 1: entity id (i)

    • Column 2: time index (t)

    The panel may be unbalanced. The number of unique entities is n_samples, the number of unique dates is T, and the number of characteristics used as instruments is L.

    If None we use the values associated with the current model

  • mean_factor (boolean) – If true, the estimated factors are averaged in the time-series before prediction.

  • data_type (str) – label for data-type used for prediction, one of the following:

    1. panel

    Uses the untransformed X and y for the estimation.

    1. portfolio

    Uses a matrix of characteristic weighted portfolios (Q) as well as a matrix of weights (W) and count of non-missing observations for each time period (val_obs) for the estimation.

    See _build_portfolio for details on how these variables are formed from the initial X and y.

  • label_ind (bool) – whether to apply the indices to fitted values and return pandas Series

Returns

r2 – summary of model performance

Return type

scalar

Indices and tables