IPCA Package Documentation¶
This package provides a Python (3.6+) implementation of the Instrumented Principal Components Analysis framework by Kelly, Pruitt, Su (2017) [1, 2].
-
class
ipca.
InstrumentedPCA
(n_factors=1, intercept=False, max_iter=10000, iter_tol=1e-05, alpha=0.0, l1_ratio=1.0, n_jobs=1, backend='loky')[source]¶ Bases:
sklearn.base.BaseEstimator
This class implements the IPCA algorithm by Kelly, Pruitt, Su (2017).
- Parameters
n_factors (int, default=1) – The total number of factors to estimate. Note, the number of estimated factors is automatically reduced by the number of pre-specified factors. For example, if n_factors = 2 and one pre-specified factor is passed, then InstrumentedPCA will estimate one factor estimated in addition to the pre-specified factor.
intercept (boolean, default=False) – Determines whether the model is estimated with or without an intercept
max_iter (int, default=10000) – Maximum number of alternating least squares updates before the estimation is stopped
iter_tol (float, default=10e-6) – Tolerance threshold for stopping the alternating least squares procedure
alpha (scalar) – Regularizing constant for Gamma estimation. If this is set to zero then the estimation defaults to non-regularized.
l1_ratio (scalar) – Ratio of l1 and l2 penalties for elastic net Gamma fit.
n_jobs (scalar) – number of jobs for F step estimation in ALS, if set to one no parallelization is done
backend (str) – label for Joblib backend used for F step in ALS
-
BS_Walpha
(ndraws=1000, n_jobs=1, backend='loky')[source]¶ Bootstrap inference on the hypothesis Gamma_alpha = 0
- Parameters
ndraws (integer, default=1000) – Number of bootstrap draws and re-estimations to be performed
backend (optional) – Value is either ‘loky’ or ‘multiprocessing’
n_jobs (integer) – Number of workers to be used. If -1, all available workers are used.
- Returns
pval – P-value from the hypothesis test H0: Gamma_alpha=0
- Return type
float
-
BS_Wbeta
(l, ndraws=1000, n_jobs=1, backend='loky')[source]¶ Test of instrument significance. Bootstrap inference on the hypothesis l-th column of Gamma_beta = 0.
- Parameters
l (integer) – Position of the characteristics for which the bootstrap is to be carried out. For example, if there are 10 characteristics, l is in the range 0 to 9 (left-/right-inclusive).
ndraws (integer, default=1000) – Number of bootstrap draws and re-estimations to be performed
n_jobs (integer) – Number of workers to be used for multiprocessing. If -1, all available Workers are used.
backend (optional)
- Returns
pval – P-value from the hypothesis test H0: Gamma_alpha=0
- Return type
float
-
BS_Wdelta
(ndraws=1000, n_jobs=1, backend='loky')[source]¶ Test of PSF significance. Bootstrap inference on the hypothesis Gamma_delta = 0. Assumes that only one PSF is used and no intercept is in use. :Parameters: * ndraws (integer, default=1000) – Number of bootstrap draws and re-estimations to be performed
n_jobs (integer) – Number of workers to be used for multiprocessing. If -1, all available Workers are used.
backend (optional)
- Returns
pval – P-value from the hypothesis test H0: Gamma_delta=0
- Return type
float
-
fit
(X, y, indices=None, PSF=None, Gamma=None, Factors=None, data_type='portfolio', label_ind=False, **kwargs)[source]¶ Fits the regressor to the data using an alternating least squares scheme.
- Parameters
X (numpy array or pandas DataFrame) – matrix of characteristics where each row corresponds to a entity-time pair in indices. The number of characteristics (columns here) used as instruments is L.
If given as a DataFrame, we assume that it contains a MutliIndex mapping to each entity-time pair
y (numpy array or pandas Series) – dependent variable where indices correspond to those in X
If given as a Series, we assume that it contains a MutliIndex mapping to each entity-time pair
indices (numpy array, optional) – array containing the panel indices. Should consist of two columns:
Column 1: entity id (i)
Column 2: time index (t)
The panel may be unbalanced. The number of unique entities is n_samples, the number of unique dates is T, and the number of characteristics used as instruments is L.
PSF (numpy array, optional) – Set of pre-specified factors as matrix of dimension (M, T)
Gamma (numpy array, optional) – If provided, starting values for Gamma (see Notes)
Factors (numpy array) – If provided, starting values for Factors (see Notes)
data_type (str) – label for data-type used for ALS estimation, one of the following:
panel
ALS uses the untransformed X and y for the estimation.
This is currently marginally slower than the portfolio estimation but is necessary when performing regularized estimation (alpha > 0).
portfolio
ALS uses a matrix of characteristic weighted portfolios (Q) as well as a matrix of weights (W) and count of non-missing observations for each time period (val_obs) for the estimation.
See _build_portfolio for details on how these variables are formed from the initial X and y.
Currently, the bootstrap procedure is only implemented in terms of the portfolio data_type.
- Returns
- Return type
self
Notes
Updates InstrumentedPCA instances to include param estimates:
- Gammanumpy array
Array with dimensions (L, n_factors) containing the mapping between characteristics and factors loadings. If there are M many pre-specified factors in the model then the matrix returned is of dimension (L, (n_factors+M)). If an intercept is included in the model, its loadings are returned in the last column of Gamma.
- Factorsnumpy array
Array with dimensions (n_factors, T) containing the estimated factors. If pre-specified factors were passed the returned array is of dimension ((n_factors - M), T), corresponding to the n_factors - M many factors estimated on top of the pre-specified ones.
-
fit_path
(X, y, indices=None, PSF=None, alpha_l=None, n_splits=10, split_method=<class 'sklearn.model_selection._split.GroupKFold'>, n_jobs=1, backend='loky', **kwargs)[source]¶ Fit a path of elastic net fits for various regularizing constants
- Parameters
X (numpy array or pandas DataFrame) – matrix of characteristics where each row corresponds to a entity-time pair in indices. The number of characteristics (columns here) used as instruments is L.
If given as a DataFrame, we assume that it contains a mutliindex mapping to each entity-time pair
y (numpy array or pandas Series) – dependent variable where indices correspond to those in X
If given as a Series, we assume that it contains a mutliindex mapping to each entity-time pair
indices (numpy array, optional) – array containing the panel indices. Should consist of two columns:
Column 1: entity id (i)
Column 2: time index (t)
The panel may be unbalanced. The number of unique entities is n_samples, the number of unique dates is T, and the number of characteristics used as instruments is L.
PSF (numpy array, optional) – Set of pre-specified factors as matrix of dimension (M, T)
alpha_l (iterable, optional) – list of regularizing constants to use for path
n_splits (scalar) – number of CV partitions
split_method (sklearn cross-validation generator factory) – method to generate CV partitions
n_jobs (scalar) – number of jobs for parallel CV estimation
backend (str) – label for joblib backend
- Returns
cvmse – array of dim (P x (C + 1)) where P is the number of reg constants and C is the number of CV partitions
- Return type
numpy matrix
-
get_factors
(label_ind=False)[source]¶ returns a tuple containing Gamma and Factors
- Parameters
label_ind (bool) – if provided we return the factors as pandas DataFrames with index info applied
- Returns
containing Gamma and Factors
- Return type
tuple
-
predict
(X=None, indices=None, W=None, mean_factor=False, data_type='panel', label_ind=False)[source]¶ wrapper around different data type predict methods
- Parameters
X (numpy array or pandas DataFrame, optional) – matrix of characteristics where each row corresponds to a entity-time pair in indices. The number of characteristics (columns here) used as instruments is L.
If given as a DataFrame, we assume that it contains a mutliindex mapping to each entity-time pair
If None we use the values associated with the current model
indices (numpy array, optional) – array containing the panel indices. Should consist of two columns:
Column 1: entity id (i)
Column 2: time index (t)
The panel may be unbalanced. The number of unique entities is n_samples, the number of unique dates is T, and the number of characteristics used as instruments is L.
If None we use the values associated with the current model
W (numpy array, optional) – portfolio weight matrix of dimension (L, L, T)
If None, we use the values associated with the current model
mean_factor (boolean) – If true, the estimated factors are averaged in the time-series before prediction.
data_type (str) – label for data-type used for prediction, one of the following:
panel
Uses the untransformed X and y for the estimation.
portfolio
Uses a matrix of characteristic weighted portfolios (Q) as well as a matrix of weights (W) and count of non-missing observations for each time period (val_obs) for the estimation.
See _build_portfolio for details on how these variables are formed from the initial X and y.
label_ind (bool) – whether to apply the indices to fitted values and return pandas Series
- Returns
The exact value returned depends on two things:
- The data_type
a. If panel data_type is specified, this will be a series of values for the panel ys
b. If portfolio data_type is specified, this will a matrix of predicted char formed portfolio Qs
- label_ind
If label_ind is True, we return pandas variants of the predicted values. If not, we return the underlying numpy arrays.
- Return type
numpy array or pandas DataFrame/Series
-
predictOOS
(X=None, y=None, indices=None, mean_factor=False)[source]¶ Predicts time t+1 observation using an out-of-sample design.
- Parameters
X (numpy array) – X of stacked data. Each row corresponds to an observation (i,t) where i denotes the entity index and t denotes the time index. All data must correspond to time t, i.e. all observations occur on the same date. If an observation contains missing data NaN will be returned. Note that the number of characteristics (L) passed, has to match the number of characteristics used when fitting the regressor. The columns of the panel are organized in the following order:
Column 1: entity id (i)
Column 2: time index (t)
Column 3 to column 3+L: characteristics.
y (numpy array) – dependent variable where indices correspond to those in X
indices (numpy array) – array containing the panel indices. Should consist of two columns:
Column 1: entity id (i)
Column 2: time index (t)
The panel may be unbalanced. The number of unique entities is n_samples, the number of unique dates is T, and the number of characteristics used as instruments is L.
mean_factor (boolean) – If true, the estimated factors are averaged in the time-series before prediction.
- Returns
ypred – The length of the returned array matches the the length of data. A nan will be returned if there is missing characteristics information.
- Return type
numpy array
-
predict_panel
(X, indices, T, mean_factor=False)[source]¶ Predicts fitted values for a previously fitted regressor + panel data
- Parameters
X (numpy array) – matrix of characteristics where each row corresponds to a entity-time pair in indices. The number of characteristics (columns here) used as instruments is L.
indices (numpy array) – array containing the panel indices. Should consist of two columns:
Column 1: entity id (i)
Column 2: time index (t)
The panel may be unbalanced. The number of unique entities is n_samples, the number of unique dates is T, and the number of characteristics used as instruments is L.
T (scalar) – number of time periods in X
mean_factor (boolean) – If true, the estimated factors are averaged in the time-series before prediction.
- Returns
ypred – The length of the returned array matches the length of data. A nan will be returned if there is missing chars information.
- Return type
numpy array
-
predict_portfolio
(W, L, T, mean_factor=False)[source]¶ Predicts fitted values for a previously fitted regressor + portfolios
- Parameters
W (numpy array) – portfolio weight matrix of dimension (L, L, T)
L (scalar) – number of characteristics
T (scalar) – number of time periods
mean_factor (boolean) – If true, the estimated factors are averaged in the time-series before prediction.
- Returns
Qpred – Same dimensions as a char formed portfolios (Q)
- Return type
numpy array
-
score
(X, y=None, indices=None, mean_factor=False, data_type='panel')[source]¶ generate R^2
- Parameters
X (numpy array or pandas DataFrame) – matrix of characteristics where each row corresponds to a entity-time pair in indices. The number of characteristics (columns here) used as instruments is L.
If given as a DataFrame, we assume that it contains a mutliindex mapping to each entity-time pair
If None we use the values associated with the current model
y (numpy array or pandas Series, optional) – dependent variable where indices correspond to those in X
If given as a Series, we assume that it contains a mutliindex mapping to each entity-time pair
indices (numpy array, optional) – array containing the panel indices. Should consist of two columns:
Column 1: entity id (i)
Column 2: time index (t)
The panel may be unbalanced. The number of unique entities is n_samples, the number of unique dates is T, and the number of characteristics used as instruments is L.
If None we use the values associated with the current model
mean_factor (boolean) – If true, the estimated factors are averaged in the time-series before prediction.
data_type (str) – label for data-type used for prediction, one of the following:
panel
Uses the untransformed X and y for the estimation.
portfolio
Uses a matrix of characteristic weighted portfolios (Q) as well as a matrix of weights (W) and count of non-missing observations for each time period (val_obs) for the estimation.
See _build_portfolio for details on how these variables are formed from the initial X and y.
label_ind (bool) – whether to apply the indices to fitted values and return pandas Series
- Returns
r2 – summary of model performance
- Return type
scalar