# Comprehensive Finance and Corporate KPIs Knowledge Base

## Table of Contents
1. [Foundational Finance Terms](#foundational-finance-terms)
2. [Core Corporate Finance KPIs and Metrics](#core-corporate-finance-kpis-and-metrics)
3. [Time-Based Financial Calculations](#time-based-financial-calculations)
4. [Financial Data Structure Best Practices](#financial-data-structure-best-practices)
5. [Advanced Financial Analytics Implementation](#advanced-financial-analytics-implementation)
6. [Implementation Guidelines and Best Practices](#implementation-guidelines-and-best-practices)

## 1. Foundational Finance Terms
<!-- metadata: category=finance_term, complexity=basic, implementation_type=both -->

### Investment Instruments

#### Stock (Equity)
**Definition**: A security representing ownership in a corporation, entitling the owner to a portion of the company's assets and earnings.

**SQL Implementation**:
```sql
-- Stock portfolio tracking
SELECT 
    ticker,
    shares_owned,
    purchase_price,
    current_price,
    shares_owned * current_price AS market_value,
    (current_price - purchase_price) * shares_owned AS unrealized_gain,
    ROUND(100.0 * (current_price - purchase_price) / purchase_price, 2) AS return_percentage
FROM stock_holdings;
```

**Python Implementation**:
```python
def calculate_stock_metrics(df):
    """Calculate key stock performance metrics"""
    df['market_value'] = df['shares'] * df['current_price']
    df['cost_basis'] = df['shares'] * df['purchase_price']
    df['unrealized_gain'] = df['market_value'] - df['cost_basis']
    df['return_pct'] = ((df['current_price'] - df['purchase_price']) / df['purchase_price']) * 100
    df['portfolio_weight'] = df['market_value'] / df['market_value'].sum()
    return df
```

#### Bond (Fixed Income)
**Definition**: A fixed-income instrument representing a loan made by an investor to a borrower. The borrower pays periodic interest (coupons) and repays the principal at maturity.

**SQL Implementation**:
```sql
-- Bond yield and duration calculations
SELECT 
    bond_id,
    face_value,
    coupon_rate,
    current_price,
    maturity_date,
    (coupon_rate * face_value) AS annual_coupon,
    ROUND((coupon_rate * face_value) / current_price * 100, 2) AS current_yield,
    DATEDIFF(maturity_date, CURRENT_DATE) / 365.0 AS years_to_maturity
FROM bond_holdings;
```

**Python Implementation**:
```python
def calculate_bond_metrics(face_value, coupon_rate, price, years_to_maturity):
    """Calculate bond yield and duration metrics"""
    annual_coupon = face_value * coupon_rate
    current_yield = (annual_coupon / price) * 100
    
    # Yield to maturity approximation
    ytm = (annual_coupon + (face_value - price) / years_to_maturity) / ((face_value + price) / 2)
    
    # Macaulay duration
    cash_flows = [annual_coupon] * int(years_to_maturity) + [face_value + annual_coupon]
    periods = list(range(1, int(years_to_maturity) + 2))
    pv_cash_flows = [cf / (1 + ytm) ** t for cf, t in zip(cash_flows, periods)]
    duration = sum(t * pv for t, pv in zip(periods, pv_cash_flows)) / sum(pv_cash_flows)
    
    return {
        'current_yield': current_yield,
        'ytm': ytm * 100,
        'duration': duration
    }
```

#### ETF (Exchange-Traded Fund)
**Definition**: A security that tracks an index, sector, commodity, or other asset, tradeable on exchanges like stocks.

**SQL Implementation**:
```sql
-- ETF performance tracking with expense ratio impact
SELECT 
    etf_symbol,
    nav,
    market_price,
    expense_ratio,
    ABS(nav - market_price) / nav * 100 AS premium_discount,
    total_return - expense_ratio AS net_return
FROM etf_data;
```

### Market Metrics

#### P/E Ratio (Price-to-Earnings Ratio)
**Definition**: Valuation ratio comparing a company's share price to its per-share earnings.

**Formula**: `P/E Ratio = Market Price per Share ÷ Earnings per Share`

**SQL Implementation**:
```sql
-- P/E ratio calculation with sector comparison
WITH pe_calculations AS (
    SELECT 
        ticker,
        sector,
        current_price,
        eps_ttm,
        ROUND(current_price / NULLIF(eps_ttm, 0), 2) AS pe_ratio,
        AVG(current_price / NULLIF(eps_ttm, 0)) OVER (PARTITION BY sector) AS sector_avg_pe
    FROM stock_fundamentals
    WHERE eps_ttm > 0
)
SELECT 
    *,
    CASE 
        WHEN pe_ratio < sector_avg_pe * 0.8 THEN 'Undervalued'
        WHEN pe_ratio > sector_avg_pe * 1.2 THEN 'Overvalued'
        ELSE 'Fair Value'
    END AS valuation_assessment
FROM pe_calculations;
```

**Python Implementation**:
```python
def analyze_pe_ratios(df):
    """Analyze P/E ratios with sector comparisons"""
    # Calculate P/E ratio
    df['pe_ratio'] = df['price'] / df['eps_ttm'].replace(0, np.nan)
    
    # Calculate sector averages
    df['sector_avg_pe'] = df.groupby('sector')['pe_ratio'].transform('mean')
    df['pe_vs_sector'] = df['pe_ratio'] / df['sector_avg_pe']
    
    # PEG ratio (P/E to growth)
    df['peg_ratio'] = df['pe_ratio'] / df['earnings_growth_rate']
    
    # Forward P/E
    df['forward_pe'] = df['price'] / df['eps_forecast']
    
    return df
```

#### Market Capitalization
**Definition**: Total dollar market value of a company's outstanding shares.

**Formula**: `Market Cap = Share Price × Total Outstanding Shares`

**SQL Implementation**:
```sql
-- Market cap calculation with size categorization
SELECT 
    ticker,
    current_price * shares_outstanding AS market_cap,
    CASE 
        WHEN current_price * shares_outstanding >= 200000000000 THEN 'Mega Cap'
        WHEN current_price * shares_outstanding >= 10000000000 THEN 'Large Cap'
        WHEN current_price * shares_outstanding >= 2000000000 THEN 'Mid Cap'
        WHEN current_price * shares_outstanding >= 300000000 THEN 'Small Cap'
        ELSE 'Micro Cap'
    END AS market_cap_category
FROM stock_data;
```

### Market Conditions

#### Bull Market / Bear Market
**Definitions**: 
- Bull Market: Rising prices with optimism (typically 20%+ gain from recent low)
- Bear Market: Falling prices with pessimism (typically 20%+ decline from recent high)

**SQL Implementation**:
```sql
-- Market regime detection
WITH market_levels AS (
    SELECT 
        index_date,
        index_value,
        MAX(index_value) OVER (ORDER BY index_date ROWS BETWEEN 252 PRECEDING AND CURRENT ROW) AS rolling_high,
        MIN(index_value) OVER (ORDER BY index_date ROWS BETWEEN 252 PRECEDING AND CURRENT ROW) AS rolling_low
    FROM market_index
)
SELECT 
    index_date,
    index_value,
    CASE 
        WHEN index_value >= rolling_low * 1.20 THEN 'Bull Market'
        WHEN index_value <= rolling_high * 0.80 THEN 'Bear Market'
        ELSE 'Neutral'
    END AS market_regime,
    ROUND(100.0 * (index_value - rolling_high) / rolling_high, 2) AS drawdown_pct
FROM market_levels;
```

**Python Implementation**:
```python
def identify_market_regimes(df):
    """Identify bull/bear market regimes"""
    # Calculate rolling highs and lows
    df['rolling_high'] = df['index_value'].rolling(window=252).max()
    df['rolling_low'] = df['index_value'].rolling(window=252).min()
    
    # Calculate drawdown
    df['drawdown'] = (df['index_value'] - df['rolling_high']) / df['rolling_high'] * 100
    
    # Identify regime
    conditions = [
        df['index_value'] >= df['rolling_low'] * 1.20,
        df['index_value'] <= df['rolling_high'] * 0.80
    ]
    choices = ['Bull Market', 'Bear Market']
    df['market_regime'] = np.select(conditions, choices, default='Neutral')
    
    # Duration of current regime
    df['regime_change'] = df['market_regime'] != df['market_regime'].shift()
    df['regime_duration'] = df.groupby(df['regime_change'].cumsum()).cumcount() + 1
    
    return df
```

### Trading Concepts

#### Liquidity
**Definition**: The degree to which an asset can be quickly bought or sold at a price reflecting its intrinsic value.

**SQL Implementation**:
```sql
-- Liquidity metrics calculation
SELECT 
    ticker,
    AVG(daily_volume) AS avg_daily_volume,
    AVG(daily_volume * close_price) AS avg_daily_dollar_volume,
    AVG(bid_ask_spread) AS avg_spread,
    COUNT(DISTINCT trading_date) AS days_traded,
    STDDEV(daily_return) AS volatility,
    -- Liquidity score (higher is better)
    LOG10(AVG(daily_volume * close_price)) - (AVG(bid_ask_spread) * 100) AS liquidity_score
FROM market_data
WHERE trading_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
GROUP BY ticker;
```

**Python Implementation**:
```python
def calculate_liquidity_metrics(df):
    """Calculate comprehensive liquidity metrics"""
    liquidity_metrics = df.groupby('ticker').agg({
        'volume': ['mean', 'std'],
        'dollar_volume': 'mean',
        'bid_ask_spread': 'mean',
        'trading_days': 'count'
    })
    
    # Amihud illiquidity measure
    df['price_impact'] = abs(df['return']) / df['dollar_volume']
    amihud_illiquidity = df.groupby('ticker')['price_impact'].mean()
    
    # Roll's implicit spread estimator
    df['return_lag'] = df.groupby('ticker')['return'].shift(1)
    roll_spread = df.groupby('ticker').apply(
        lambda x: 2 * np.sqrt(-np.cov(x['return'], x['return_lag'])[0,1])
        if len(x) > 2 else np.nan
    )
    
    return liquidity_metrics, amihud_illiquidity, roll_spread
```

## 2. Core Corporate Finance KPIs and Metrics

### EBITDA (Earnings Before Interest, Taxes, Depreciation, and Amortization)
<!-- metadata: category=kpi, subcategory=profitability, complexity=intermediate, implementation_type=both, calculation_type=aggregate -->

**Definition**: EBITDA measures a company's operating profitability excluding non-operating expenses and non-cash charges. It represents normalized, pre-tax operating cash flow from core business activities.

**Formulas**:
- Primary: `EBITDA = Net Income + Interest + Taxes + Depreciation + Amortization`
- Alternative: `EBITDA = Operating Income + Depreciation + Amortization`

**SQL Implementation**:
```sql
SELECT 
    company_id,
    reporting_period,
    net_income + interest_expense + tax_expense + depreciation + amortization AS ebitda,
    ROUND(100.0 * (net_income + interest_expense + tax_expense + depreciation + amortization) / revenue, 2) AS ebitda_margin
FROM income_statement
WHERE reporting_period > '2023-01-01';
```

**Python Implementation**:
```python
def calculate_ebitda(df):
    """Calculate EBITDA and EBITDA margin"""
    df['ebitda'] = df['net_income'] + df['interest_expense'] + df['tax_expense'] + df['depreciation'] + df['amortization']
    df['ebitda_margin'] = (df['ebitda'] / df['revenue']) * 100
    return df

# Usage example
financial_data['ebitda'] = financial_data[['net_income', 'interest_expense', 'tax_expense', 'depreciation', 'amortization']].sum(axis=1)
```

**Industry Benchmarks**: EBITDA margins below 10% are concerning, 10-20% are good, above 20% are excellent. Varies significantly by industry (capital-intensive vs. service-based).

### Return on Equity (ROE)
<!-- metadata: category=kpi, subcategory=profitability, complexity=intermediate, implementation_type=both, calculation_type=ratio -->

**Definition**: ROE measures profitability relative to shareholders' equity, indicating how efficiently management uses shareholders' capital to generate profits.

**Formula**: `ROE = Net Income ÷ Average Shareholders' Equity × 100%`

**DuPont Analysis**: `ROE = (Net Income ÷ Sales) × (Sales ÷ Assets) × (Assets ÷ Equity)`

**SQL Implementation**:
```sql
WITH quarterly_roe AS (
    SELECT 
        company_id,
        reporting_quarter,
        net_income,
        shareholders_equity,
        LAG(shareholders_equity) OVER (PARTITION BY company_id ORDER BY reporting_quarter) AS prev_equity,
        ROUND(100.0 * net_income / ((shareholders_equity + LAG(shareholders_equity) OVER (PARTITION BY company_id ORDER BY reporting_quarter)) / 2), 2) AS roe
    FROM financial_statements
)
SELECT * FROM quarterly_roe WHERE roe IS NOT NULL;
```

**Python Implementation**:
```python
def calculate_roe(df):
    """Calculate Return on Equity with DuPont decomposition"""
    # Sort by company and date for proper lag calculation
    df = df.sort_values(['company', 'date'])
    
    # Calculate average equity
    df['prev_equity'] = df.groupby('company')['shareholders_equity'].shift(1)
    df['avg_equity'] = (df['shareholders_equity'] + df['prev_equity']) / 2
    
    # Calculate ROE
    df['roe'] = (df['net_income'] / df['avg_equity']) * 100
    
    # DuPont analysis
    df['profit_margin'] = df['net_income'] / df['revenue']
    df['asset_turnover'] = df['revenue'] / df['total_assets'] 
    df['equity_multiplier'] = df['total_assets'] / df['shareholders_equity']
    df['roe_dupont'] = df['profit_margin'] * df['asset_turnover'] * df['equity_multiplier'] * 100
    
    return df
```

**Benchmarks**: S&P 500 average ~12-15%, excellent ROE is 15-20%+, above 20% is exceptional. Technology companies often achieve 15-25%+.

### Current Ratio and Liquidity Metrics

**Definition**: Current ratio measures ability to pay short-term obligations using current assets. Quick ratio excludes inventory for more conservative liquidity assessment.

**Formulas**:
- Current Ratio: `Current Assets ÷ Current Liabilities`
- Quick Ratio: `(Current Assets - Inventory) ÷ Current Liabilities`
- Cash Ratio: `(Cash + Short-term Investments) ÷ Current Liabilities`

**SQL Implementation**:
```sql
SELECT 
    company_id,
    reporting_date,
    ROUND(current_assets / NULLIF(current_liabilities, 0), 2) AS current_ratio,
    ROUND((current_assets - inventory) / NULLIF(current_liabilities, 0), 2) AS quick_ratio,
    ROUND((cash + short_term_investments) / NULLIF(current_liabilities, 0), 2) AS cash_ratio,
    CASE 
        WHEN current_assets / NULLIF(current_liabilities, 0) < 1.0 THEN 'Poor Liquidity'
        WHEN current_assets / NULLIF(current_liabilities, 0) BETWEEN 1.0 AND 1.5 THEN 'Adequate'
        WHEN current_assets / NULLIF(current_liabilities, 0) BETWEEN 1.5 AND 3.0 THEN 'Good'
        ELSE 'Excellent'
    END AS liquidity_assessment
FROM balance_sheet;
```

**Python Implementation**:
```python
def calculate_liquidity_ratios(df):
    """Calculate comprehensive liquidity ratios"""
    # Avoid division by zero
    df['current_liabilities'] = df['current_liabilities'].replace(0, np.nan)
    
    # Calculate ratios
    df['current_ratio'] = df['current_assets'] / df['current_liabilities']
    df['quick_ratio'] = (df['current_assets'] - df['inventory']) / df['current_liabilities']
    df['cash_ratio'] = (df['cash'] + df['short_term_investments']) / df['current_liabilities']
    
    # Liquidity assessment
    conditions = [
        df['current_ratio'] < 1.0,
        df['current_ratio'].between(1.0, 1.5),
        df['current_ratio'].between(1.5, 3.0),
        df['current_ratio'] >= 3.0
    ]
    choices = ['Poor Liquidity', 'Adequate', 'Good', 'Excellent']
    df['liquidity_assessment'] = np.select(conditions, choices, default='Unknown')
    
    return df
```

**Benchmarks**: Ideal current ratio 1.5-3.0, quick ratio should be >1.0, cash ratio varies by industry but >0.2 is generally healthy.

## 3. Time-Based Financial Calculations

### Year-to-Date (YTD) Calculations
<!-- metadata: category=calculation, subcategory=time_series, complexity=intermediate, implementation_type=both, data_frequency=daily -->

**Definition**: YTD represents cumulative performance from the beginning of the fiscal year through the current date, enabling progress tracking toward annual targets.

**Formula**: `YTD Value = Sum of all values from start of fiscal year to current date`

**SQL Implementation**:
```sql
-- YTD calculations with fiscal year considerations
WITH fiscal_ytd AS (
    SELECT 
        company_id,
        transaction_date,
        amount,
        -- Fiscal year calculation (April start)
        CASE 
            WHEN MONTH(transaction_date) >= 4 
            THEN YEAR(transaction_date)
            ELSE YEAR(transaction_date) - 1
        END AS fiscal_year,
        -- YTD calculation
        SUM(amount) OVER (
            PARTITION BY company_id, 
                        CASE WHEN MONTH(transaction_date) >= 4 
                             THEN YEAR(transaction_date)
                             ELSE YEAR(transaction_date) - 1 END
            ORDER BY transaction_date 
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS ytd_amount
    FROM financial_transactions
)
SELECT 
    company_id,
    fiscal_year,
    MAX(ytd_amount) AS fiscal_ytd_total,
    ROUND(100.0 * (MAX(ytd_amount) - LAG(MAX(ytd_amount)) OVER (PARTITION BY company_id ORDER BY fiscal_year)) / 
          LAG(MAX(ytd_amount)) OVER (PARTITION BY company_id ORDER BY fiscal_year), 2) AS ytd_growth_rate
FROM fiscal_ytd
GROUP BY company_id, fiscal_year;
```

**Python Implementation**:
```python
def calculate_ytd_metrics(df, fiscal_year_start_month=1):
    """Calculate YTD metrics with flexible fiscal year"""
    df = df.copy()
    df['date'] = pd.to_datetime(df['date'])
    
    # Calculate fiscal year
    df['fiscal_year'] = df['date'].dt.year
    mask = df['date'].dt.month < fiscal_year_start_month
    df.loc[mask, 'fiscal_year'] = df.loc[mask, 'fiscal_year'] - 1
    
    # Sort for proper cumulative calculation
    df = df.sort_values(['company', 'date'])
    
    # Calculate YTD
    df['ytd_revenue'] = df.groupby(['company', 'fiscal_year'])['revenue'].cumsum()
    df['ytd_expenses'] = df.groupby(['company', 'fiscal_year'])['expenses'].cumsum()
    df['ytd_profit'] = df['ytd_revenue'] - df['ytd_expenses']
    
    # YTD growth rate
    df['prev_year_ytd'] = df.groupby(['company', df['date'].dt.dayofyear])['ytd_revenue'].shift(1)
    df['ytd_growth_rate'] = ((df['ytd_revenue'] - df['prev_year_ytd']) / df['prev_year_ytd']) * 100
    
    return df

# Usage example
ytd_data = calculate_ytd_metrics(financial_data, fiscal_year_start_month=4)
```

### Year-over-Year (YoY) Growth Analysis
<!-- metadata: category=calculation, subcategory=time_series, complexity=intermediate, implementation_type=both, calculation_type=growth, data_frequency=quarterly -->

**Definition**: YoY growth measures percentage change between current period and same period previous year, eliminating seasonal effects.

**Formula**: `YoY Growth = ((Current Period - Prior Year Period) / Prior Year Period) × 100`

**SQL Implementation**:
```sql
-- Comprehensive YoY analysis with multiple periods
WITH period_comparisons AS (
    SELECT 
        company_id,
        reporting_period,
        revenue,
        net_income,
        -- YoY comparisons
        LAG(revenue, 4) OVER (PARTITION BY company_id ORDER BY reporting_period) AS revenue_4q_ago,
        LAG(net_income, 4) OVER (PARTITION BY company_id ORDER BY reporting_period) AS income_4q_ago,
        -- QoQ comparisons
        LAG(revenue, 1) OVER (PARTITION BY company_id ORDER BY reporting_period) AS revenue_1q_ago,
        -- Calculate growth rates
        ROUND(100.0 * (revenue - LAG(revenue, 4) OVER (PARTITION BY company_id ORDER BY reporting_period)) / 
              NULLIF(LAG(revenue, 4) OVER (PARTITION BY company_id ORDER BY reporting_period), 0), 2) AS revenue_yoy_growth,
        ROUND(100.0 * (revenue - LAG(revenue, 1) OVER (PARTITION BY company_id ORDER BY reporting_period)) / 
              NULLIF(LAG(revenue, 1) OVER (PARTITION BY company_id ORDER BY reporting_period), 0), 2) AS revenue_qoq_growth
    FROM quarterly_financials
)
SELECT * FROM period_comparisons WHERE revenue_4q_ago IS NOT NULL;
```

**Python Implementation**:
```python
def calculate_growth_rates(df):
    """Calculate comprehensive growth rates (YoY, QoQ, etc.)"""
    df = df.sort_values(['company', 'date'])
    
    # YoY growth (12 months/4 quarters)
    df['revenue_yoy'] = df.groupby('company')['revenue'].pct_change(periods=4) * 100
    df['profit_yoy'] = df.groupby('company')['net_income'].pct_change(periods=4) * 100
    
    # QoQ growth (1 quarter)
    df['revenue_qoq'] = df.groupby('company')['revenue'].pct_change(periods=1) * 100
    
    # Rolling 12-month growth
    df['revenue_ttm'] = df.groupby('company')['revenue'].rolling(window=4).sum().reset_index(0, drop=True)
    df['revenue_ttm_growth'] = df.groupby('company')['revenue_ttm'].pct_change(periods=4) * 100
    
    # Growth acceleration/deceleration
    df['growth_acceleration'] = df.groupby('company')['revenue_yoy'].diff()
    
    return df

# Advanced growth analysis with seasonal adjustment
def seasonal_adjusted_growth(df, seasonal_periods=4):
    """Calculate seasonally adjusted YoY growth"""
    from statsmodels.tsa.seasonal import seasonal_decompose
    
    results = {}
    for company in df['company'].unique():
        company_data = df[df['company'] == company].set_index('date')['revenue']
        
        # Seasonal decomposition
        decomposition = seasonal_decompose(company_data, model='additive', period=seasonal_periods)
        
        # Seasonally adjusted data
        seasonally_adjusted = company_data - decomposition.seasonal
        
        # Calculate YoY growth on adjusted data
        adjusted_growth = seasonally_adjusted.pct_change(periods=seasonal_periods) * 100
        
        results[company] = adjusted_growth
    
    return results
```

### Trailing Twelve Months (TTM) Calculations

**Definition**: TTM represents performance over the most recent 12-month period regardless of fiscal year boundaries, providing current annualized performance view.

**Formula**: `TTM = Latest Fiscal Year + Current YTD - Prior Year YTD`

**SQL Implementation**:
```sql
-- Comprehensive TTM calculations
WITH ttm_calculations AS (
    SELECT 
        company_id,
        reporting_date,
        revenue,
        net_income,
        -- TTM using window function
        SUM(revenue) OVER (
            PARTITION BY company_id 
            ORDER BY reporting_date 
            ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
        ) AS ttm_revenue,
        SUM(net_income) OVER (
            PARTITION BY company_id 
            ORDER BY reporting_date 
            ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
        ) AS ttm_net_income,
        -- TTM ratios
        AVG(roa) OVER (
            PARTITION BY company_id 
            ORDER BY reporting_date 
            ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
        ) AS ttm_avg_roa
    FROM monthly_financials
),
ttm_with_growth AS (
    SELECT 
        *,
        LAG(ttm_revenue, 12) OVER (PARTITION BY company_id ORDER BY reporting_date) AS ttm_revenue_12m_ago,
        ROUND(100.0 * (ttm_revenue - LAG(ttm_revenue, 12) OVER (PARTITION BY company_id ORDER BY reporting_date)) / 
              NULLIF(LAG(ttm_revenue, 12) OVER (PARTITION BY company_id ORDER BY reporting_date), 0), 2) AS ttm_revenue_growth
    FROM ttm_calculations
)
SELECT * FROM ttm_with_growth WHERE ttm_revenue_12m_ago IS NOT NULL;
```

**Python Implementation**:
```python
def calculate_ttm_metrics(df):
    """Calculate trailing twelve months metrics"""
    df = df.sort_values(['company', 'date'])
    
    # TTM calculations using rolling windows
    df['ttm_revenue'] = df.groupby('company')['revenue'].rolling(window=12).sum().reset_index(0, drop=True)
    df['ttm_net_income'] = df.groupby('company')['net_income'].rolling(window=12).sum().reset_index(0, drop=True)
    df['ttm_ebitda'] = df.groupby('company')['ebitda'].rolling(window=12).sum().reset_index(0, drop=True)
    
    # TTM growth rates
    df['ttm_revenue_growth'] = df.groupby('company')['ttm_revenue'].pct_change(periods=12) * 100
    
    # TTM ratios
    df['ttm_profit_margin'] = (df['ttm_net_income'] / df['ttm_revenue']) * 100
    df['ttm_roe'] = (df['ttm_net_income'] / df['shareholders_equity']) * 100
    
    # TTM per share metrics
    df['ttm_eps'] = df['ttm_net_income'] / df['shares_outstanding']
    df['ttm_pe_ratio'] = df['stock_price'] / df['ttm_eps']
    
    return df

# Alternative TTM calculation method
def calculate_ttm_alternative(df, current_date):
    """Alternative TTM calculation using specific date ranges"""
    end_date = pd.to_datetime(current_date)
    start_date = end_date - pd.DateOffset(months=12)
    
    ttm_data = df[(df['date'] > start_date) & (df['date'] <= end_date)]
    
    ttm_summary = ttm_data.groupby('company').agg({
        'revenue': 'sum',
        'net_income': 'sum',
        'ebitda': 'sum',
        'free_cash_flow': 'sum'
    }).add_prefix('ttm_')
    
    return ttm_summary
```

### Compound Annual Growth Rate (CAGR)

**Definition**: CAGR represents the annualized rate of growth assuming steady compounding, smoothing year-to-year volatility.

**Formula**: `CAGR = (Ending Value / Beginning Value)^(1/Number of Years) - 1`

**SQL Implementation**:
```sql
-- CAGR calculation using POWER function
WITH cagr_analysis AS (
    SELECT 
        company_id,
        MIN(reporting_date) AS start_date,
        MAX(reporting_date) AS end_date,
        MIN(CASE WHEN reporting_date = (SELECT MIN(reporting_date) FROM financials f2 WHERE f2.company_id = f1.company_id) 
            THEN revenue END) AS starting_revenue,
        MAX(CASE WHEN reporting_date = (SELECT MAX(reporting_date) FROM financials f2 WHERE f2.company_id = f1.company_id) 
            THEN revenue END) AS ending_revenue,
        DATEDIFF(YEAR, MIN(reporting_date), MAX(reporting_date)) AS years_span
    FROM financials f1
    GROUP BY company_id
    HAVING COUNT(*) > 1
),
cagr_results AS (
    SELECT 
        company_id,
        starting_revenue,
        ending_revenue,
        years_span,
        ROUND(100.0 * (POWER(ending_revenue / NULLIF(starting_revenue, 0), 1.0 / NULLIF(years_span, 0)) - 1), 2) AS revenue_cagr,
        -- Future value projection
        ROUND(ending_revenue * POWER(1 + (POWER(ending_revenue / NULLIF(starting_revenue, 0), 1.0 / NULLIF(years_span, 0)) - 1), 3), 0) AS projected_revenue_3yr
    FROM cagr_analysis
    WHERE starting_revenue > 0 AND ending_revenue > 0 AND years_span > 0
)
SELECT * FROM cagr_results ORDER BY revenue_cagr DESC;
```

**Python Implementation**:
```python
def calculate_cagr(df, value_column='revenue', date_column='date', periods_per_year=4):
    """Calculate CAGR for specified metrics"""
    results = {}
    
    for company in df['company'].unique():
        company_data = df[df['company'] == company].sort_values(date_column)
        
        if len(company_data) < 2:
            continue
            
        # Get start and end values
        start_value = company_data[value_column].iloc[0]
        end_value = company_data[value_column].iloc[-1]
        
        # Calculate time span in years
        start_date = pd.to_datetime(company_data[date_column].iloc[0])
        end_date = pd.to_datetime(company_data[date_column].iloc[-1])
        years = (end_date - start_date).days / 365.25
        
        if start_value > 0 and years > 0:
            cagr = (end_value / start_value) ** (1/years) - 1
            results[company] = {
                'cagr': cagr * 100,
                'start_value': start_value,
                'end_value': end_value,
                'years': years
            }
    
    return pd.DataFrame(results).T

# Rolling CAGR calculation
def rolling_cagr(df, window_years=5, periods_per_year=4):
    """Calculate rolling CAGR over specified window"""
    df = df.sort_values(['company', 'date'])
    window_periods = window_years * periods_per_year
    
    def calc_period_cagr(series):
        if len(series) < 2 or series.iloc[0] <= 0:
            return np.nan
        return ((series.iloc[-1] / series.iloc[0]) ** (1/window_years) - 1) * 100
    
    df['rolling_cagr'] = df.groupby('company')['revenue'].rolling(
        window=window_periods
    ).apply(calc_period_cagr).reset_index(0, drop=True)
    
    return df
```

## 4. Financial Data Structure Best Practices

### Recommended Database Schema

**Core Financial Tables**:
```sql
-- Double-entry accounting foundation
CREATE TABLE transactions (
    id BIGINT PRIMARY KEY,
    document_id BIGINT,
    transaction_date DATE NOT NULL,
    description VARCHAR(255),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_trans_date (transaction_date),
    INDEX idx_document (document_id)
);

CREATE TABLE ledger_entries (
    id BIGINT PRIMARY KEY,
    transaction_id BIGINT NOT NULL,
    account_id BIGINT NOT NULL,
    entry_type CHAR(1) CHECK (entry_type IN ('D', 'C')),
    amount DECIMAL(15,2) NOT NULL,
    person_id BIGINT,
    cost_center_id BIGINT,
    FOREIGN KEY (transaction_id) REFERENCES transactions(id),
    INDEX idx_account_date (account_id, transaction_date)
);

CREATE TABLE chart_of_accounts (
    id BIGINT PRIMARY KEY,
    account_code VARCHAR(20) UNIQUE,
    account_name VARCHAR(100) NOT NULL,
    account_type ENUM('Asset', 'Liability', 'Equity', 'Revenue', 'Expense'),
    parent_account_id BIGINT,
    is_active BOOLEAN DEFAULT TRUE,
    INDEX idx_account_type (account_type),
    INDEX idx_parent (parent_account_id)
);
```

**Time Series Financial Data**:
```sql
-- Optimized for market and operational data
CREATE TABLE market_data (
    symbol VARCHAR(10),
    timestamp DATETIME(6),
    price DECIMAL(12,4),
    volume BIGINT,
    bid DECIMAL(12,4),
    ask DECIMAL(12,4),
    PRIMARY KEY (symbol, timestamp),
    INDEX idx_symbol_time (symbol, timestamp)
) PARTITION BY RANGE (TO_DAYS(timestamp)) (
    PARTITION p202301 VALUES LESS THAN (TO_DAYS('2023-02-01')),
    PARTITION p202302 VALUES LESS THAN (TO_DAYS('2023-03-01'))
);

-- Financial statements with version control
CREATE TABLE financial_statements (
    company_id BIGINT,
    statement_type ENUM('BS', 'IS', 'CF'),
    reporting_period DATE,
    version_number INT DEFAULT 1,
    filed_date DATE,
    line_item VARCHAR(50),
    amount DECIMAL(15,2),
    currency_code CHAR(3) DEFAULT 'USD',
    PRIMARY KEY (company_id, statement_type, reporting_period, version_number, line_item),
    INDEX idx_period_type (reporting_period, statement_type)
);
```

### Data Validation and Quality Controls

**Python Data Validation Framework**:
```python
import pandera as pa
from pandera import Column, DataFrameSchema, Check

# Financial data schema validation
financial_schema = DataFrameSchema({
    "company_id": Column(str, checks=Check.str_length(1, 10)),
    "reporting_date": Column(pd.Timestamp),
    "revenue": Column(float, checks=Check.greater_than(0)),
    "net_income": Column(float, nullable=True),
    "total_assets": Column(float, checks=Check.greater_than(0)),
    "current_ratio": Column(float, checks=Check.in_range(0, 10), nullable=True),
    "debt_to_equity": Column(float, checks=Check.greater_than_or_equal_to(0), nullable=True)
})

@pa.check_input(financial_schema)
def process_financial_data(df):
    """Process financial data with automatic validation"""
    # Business logic validations
    assert (df['net_income'] <= df['revenue']).all(), "Net income cannot exceed revenue"
    assert not df[df['revenue'] > 0]['total_assets'].isna().any(), "Assets required when revenue > 0"
    
    # Calculate derived metrics
    df['profit_margin'] = df['net_income'] / df['revenue']
    df['roa'] = df['net_income'] / df['total_assets']
    
    return df

# Data quality monitoring
def assess_data_quality(df):
    """Comprehensive data quality assessment"""
    quality_report = {
        'completeness': (1 - df.isnull().sum() / len(df)) * 100,
        'duplicates': df.duplicated().sum(),
        'outliers': {},
        'consistency': {}
    }
    
    # Outlier detection using IQR
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    for col in numeric_columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]
        quality_report['outliers'][col] = len(outliers)
    
    # Consistency checks
    if 'current_ratio' in df.columns:
        quality_report['consistency']['current_ratio_range'] = (
            df['current_ratio'].between(0, 10).sum() / len(df) * 100
        )
    
    return quality_report
```

### ETL Pipeline Architecture

**Data Transformation Pipeline**:
```python
class FinancialETLPipeline:
    """Production-ready ETL pipeline for financial data"""
    
    def __init__(self, config):
        self.config = config
        self.logger = self._setup_logging()
        
    def extract(self, source_config):
        """Extract data from various sources"""
        extractors = {
            'database': self._extract_from_db,
            'api': self._extract_from_api,
            'file': self._extract_from_file
        }
        
        data_frames = []
        for source in source_config:
            extractor = extractors[source['type']]
            df = extractor(source)
            data_frames.append(df)
            
        return pd.concat(data_frames, ignore_index=True)
    
    def transform(self, df):
        """Apply financial data transformations"""
        # Data cleaning
        df = self._clean_data(df)
        
        # Currency conversion
        df = self._convert_currencies(df)
        
        # Calculate financial metrics
        df = self._calculate_ratios(df)
        
        # Apply business rules
        df = self._apply_business_rules(df)
        
        # Data validation
        df = self._validate_data(df)
        
        return df
    
    def load(self, df, destination_config):
        """Load data to target systems"""
        if destination_config['type'] == 'database':
            self._load_to_database(df, destination_config)
        elif destination_config['type'] == 'data_warehouse':
            self._load_to_warehouse(df, destination_config)
        elif destination_config['type'] == 'api':
            self._load_to_api(df, destination_config)
    
    def _calculate_ratios(self, df):
        """Calculate comprehensive financial ratios"""
        # Profitability ratios
        df['gross_margin'] = (df['revenue'] - df['cogs']) / df['revenue'] * 100
        df['operating_margin'] = df['operating_income'] / df['revenue'] * 100
        df['net_margin'] = df['net_income'] / df['revenue'] * 100
        
        # Liquidity ratios
        df['current_ratio'] = df['current_assets'] / df['current_liabilities']
        df['quick_ratio'] = (df['current_assets'] - df['inventory']) / df['current_liabilities']
        
        # Leverage ratios
        df['debt_to_equity'] = df['total_debt'] / df['shareholders_equity']
        df['interest_coverage'] = df['ebit'] / df['interest_expense']
        
        # Efficiency ratios
        df['asset_turnover'] = df['revenue'] / df['total_assets']
        df['inventory_turnover'] = df['cogs'] / df['inventory']
        
        return df
```

## 5. Advanced Financial Analytics Implementation

### Portfolio Risk Analytics
<!-- metadata: category=implementation, subcategory=risk, complexity=advanced, implementation_type=both, calculation_type=aggregate -->

**Value at Risk (VaR) Implementation**:
```python
def calculate_portfolio_risk_metrics(returns_df, confidence_levels=[0.95, 0.99]):
    """Calculate comprehensive portfolio risk metrics"""
    risk_metrics = {}
    
    for column in returns_df.columns:
        returns = returns_df[column].dropna()
        
        # Basic statistics
        risk_metrics[f'{column}_mean'] = returns.mean() * 252
        risk_metrics[f'{column}_volatility'] = returns.std() * np.sqrt(252)
        
        # Value at Risk
        for conf_level in confidence_levels:
            var_level = 1 - conf_level
            var_value = returns.quantile(var_level)
            risk_metrics[f'{column}_var_{int(conf_level*100)}'] = var_value
            
            # Expected Shortfall (Conditional VaR)
            es_value = returns[returns <= var_value].mean()
            risk_metrics[f'{column}_es_{int(conf_level*100)}'] = es_value
        
        # Maximum Drawdown
        cumulative_returns = (1 + returns).cumprod()
        running_max = cumulative_returns.expanding().max()
        drawdown = (cumulative_returns - running_max) / running_max
        risk_metrics[f'{column}_max_drawdown'] = drawdown.min()
        
        # Sharpe Ratio (assuming 0% risk-free rate)
        risk_metrics[f'{column}_sharpe'] = (
            risk_metrics[f'{column}_mean'] / risk_metrics[f'{column}_volatility']
        )
        
        # Sortino Ratio
        downside_returns = returns[returns < 0]
        downside_deviation = downside_returns.std() * np.sqrt(252)
        risk_metrics[f'{column}_sortino'] = risk_metrics[f'{column}_mean'] / downside_deviation
    
    return pd.DataFrame([risk_metrics])

# SQL implementation for portfolio risk
```sql
-- Portfolio risk calculation in SQL
WITH daily_returns AS (
    SELECT 
        symbol,
        price_date,
        (price - LAG(price) OVER (PARTITION BY symbol ORDER BY price_date)) / 
        LAG(price) OVER (PARTITION BY symbol ORDER BY price_date) AS daily_return
    FROM stock_prices
),
risk_calculations AS (
    SELECT 
        symbol,
        AVG(daily_return) * 252 AS annualized_return,
        STDDEV(daily_return) * SQRT(252) AS annualized_volatility,
        PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY daily_return) AS var_95,
        PERCENTILE_CONT(0.01) WITHIN GROUP (ORDER BY daily_return) AS var_99
    FROM daily_returns
    WHERE daily_return IS NOT NULL
    GROUP BY symbol
)
SELECT 
    symbol,
    annualized_return,
    annualized_volatility,
    annualized_return / annualized_volatility AS sharpe_ratio,
    var_95,
    var_99
FROM risk_calculations;
```

### Advanced Financial Statement Analysis

**Financial Statement Ratio Analysis**:
```python
def comprehensive_financial_analysis(df):
    """Perform comprehensive financial statement analysis"""
    
    # Profitability Analysis
    profitability_metrics = {
        'gross_profit_margin': (df['revenue'] - df['cogs']) / df['revenue'] * 100,
        'operating_margin': df['operating_income'] / df['revenue'] * 100,
        'net_profit_margin': df['net_income'] / df['revenue'] * 100,
        'return_on_assets': df['net_income'] / df['total_assets'] * 100,
        'return_on_equity': df['net_income'] / df['shareholders_equity'] * 100,
        'return_on_invested_capital': df['nopat'] / df['invested_capital'] * 100
    }
    
    # Liquidity Analysis
    liquidity_metrics = {
        'current_ratio': df['current_assets'] / df['current_liabilities'],
        'quick_ratio': (df['current_assets'] - df['inventory']) / df['current_liabilities'],
        'cash_ratio': df['cash'] / df['current_liabilities'],
        'operating_cash_flow_ratio': df['operating_cash_flow'] / df['current_liabilities']
    }
    
    # Leverage Analysis
    leverage_metrics = {
        'debt_to_equity': df['total_debt'] / df['shareholders_equity'],
        'debt_to_assets': df['total_debt'] / df['total_assets'],
        'interest_coverage': df['ebit'] / df['interest_expense'],
        'debt_service_coverage': df['operating_cash_flow'] / df['debt_service']
    }
    
    # Efficiency Analysis
    efficiency_metrics = {
        'asset_turnover': df['revenue'] / df['total_assets'],
        'inventory_turnover': df['cogs'] / df['inventory'],
        'receivables_turnover': df['revenue'] / df['accounts_receivable'],
        'days_sales_outstanding': 365 / (df['revenue'] / df['accounts_receivable']),
        'days_inventory_outstanding': 365 / (df['cogs'] / df['inventory'])
    }
    
    # Combine all metrics
    all_metrics = {**profitability_metrics, **liquidity_metrics, 
                   **leverage_metrics, **efficiency_metrics}
    
    for metric_name, metric_value in all_metrics.items():
        df[metric_name] = metric_value
    
    # Calculate DuPont analysis
    df['dupont_roe'] = (df['net_profit_margin'] / 100) * df['asset_turnover'] * (df['total_assets'] / df['shareholders_equity'])
    
    return df
```

### Multi-Period Comparative Analysis

**Trend Analysis and Forecasting**:
```python
def perform_trend_analysis(df, forecast_periods=4):
    """Perform trend analysis and basic forecasting"""
    from sklearn.linear_model import LinearRegression
    import numpy as np
    
    results = {}
    
    for company in df['company'].unique():
        company_data = df[df['company'] == company].sort_values('date')
        
        if len(company_data) < 8:  # Need sufficient history
            continue
            
        # Prepare data for trend analysis
        company_data['period'] = range(len(company_data))
        
        metrics_to_analyze = ['revenue', 'net_income', 'total_assets', 'shareholders_equity']
        company_results = {}
        
        for metric in metrics_to_analyze:
            if metric not in company_data.columns:
                continue
                
            # Linear trend analysis
            X = company_data['period'].values.reshape(-1, 1)
            y = company_data[metric].values
            
            model = LinearRegression()
            model.fit(X, y)
            
            # Calculate trend metrics
            trend_slope = model.coef_[0]
            r_squared = model.score(X, y)
            
            # Forecast future periods
            future_periods = np.arange(len(company_data), len(company_data) + forecast_periods).reshape(-1, 1)
            forecasted_values = model.predict(future_periods)
            
            company_results[metric] = {
                'trend_slope': trend_slope,
                'r_squared': r_squared,
                'forecast': forecasted_values.tolist(),
                'trend_direction': 'Increasing' if trend_slope > 0 else 'Decreasing'
            }
        
        results[company] = company_results
    
    return results

# SQL implementation for trend analysis
```sql
-- Multi-period trend analysis
WITH quarterly_metrics AS (
    SELECT 
        company_id,
        reporting_quarter,
        revenue,
        net_income,
        ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY reporting_quarter) AS period_number,
        -- Calculate period-over-period changes
        LAG(revenue, 1) OVER (PARTITION BY company_id ORDER BY reporting_quarter) AS prev_revenue,
        LAG(revenue, 4) OVER (PARTITION BY company_id ORDER BY reporting_quarter) AS yoy_revenue,
        -- Moving averages for trend smoothing
        AVG(revenue) OVER (
            PARTITION BY company_id 
            ORDER BY reporting_quarter 
            ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
        ) AS revenue_4q_avg
    FROM financial_statements
),
trend_analysis AS (
    SELECT 
        company_id,
        reporting_quarter,
        revenue,
        -- Growth rates
        ROUND(100.0 * (revenue - prev_revenue) / NULLIF(prev_revenue, 0), 2) AS qoq_growth,
        ROUND(100.0 * (revenue - yoy_revenue) / NULLIF(yoy_revenue, 0), 2) AS yoy_growth,
        -- Trend indicators
        CASE 
            WHEN revenue > revenue_4q_avg THEN 'Above Trend'
            WHEN revenue < revenue_4q_avg THEN 'Below Trend'
            ELSE 'On Trend'
        END AS trend_position,
        -- Growth consistency (coefficient of variation)
        STDDEV(100.0 * (revenue - prev_revenue) / NULLIF(prev_revenue, 0)) OVER (
            PARTITION BY company_id 
            ORDER BY reporting_quarter 
            ROWS BETWEEN 7 PRECEDING AND CURRENT ROW
        ) AS growth_volatility
    FROM quarterly_metrics
)
SELECT * FROM trend_analysis WHERE prev_revenue IS NOT NULL;
```

## 6. Implementation Guidelines and Best Practices

### Performance Optimization

**Database Optimization**:
```sql
-- Optimized indexes for financial queries
CREATE INDEX idx_financial_composite ON financial_data 
(company_id, reporting_date, statement_type) 
INCLUDE (amount, currency);

-- Partitioned tables for large datasets
CREATE TABLE market_data_partitioned (
    symbol VARCHAR(10),
    trade_date DATE,
    price DECIMAL(10,4),
    volume BIGINT
) PARTITION BY RANGE (YEAR(trade_date)) (
    PARTITION p2023 VALUES LESS THAN (2024),
    PARTITION p2024 VALUES LESS THAN (2025)
);

-- Materialized views for complex calculations
CREATE MATERIALIZED VIEW quarterly_ratios AS
SELECT 
    company_id,
    reporting_quarter,
    AVG(current_ratio) as avg_current_ratio,
    AVG(debt_to_equity) as avg_debt_to_equity,
    SUM(revenue) as total_revenue
FROM financial_statements
GROUP BY company_id, reporting_quarter;
```

**Python Performance Optimization**:
```python
# Efficient data processing for large datasets
def optimize_financial_processing(df):
    """Optimize DataFrame for memory and performance"""
    
    # Optimize data types
    float_cols = df.select_dtypes(include=['float64']).columns
    df[float_cols] = df[float_cols].astype('float32')
    
    int_cols = df.select_dtypes(include=['int64']).columns
    df[int_cols] = df[int_cols].astype('int32')
    
    # Convert strings to categories for repeated values
    for col in df.select_dtypes(include=['object']).columns:
        if df[col].nunique() / len(df) < 0.5:
            df[col] = df[col].astype('category')
    
    return df

# Vectorized financial calculations
def vectorized_ratio_calculations(df):
    """Use NumPy for faster calculations"""
    # Convert to numpy arrays for faster computation
    revenue = df['revenue'].values
    net_income = df['net_income'].values
    total_assets = df['total_assets'].values
    
    # Vectorized ratio calculations
    profit_margin = np.divide(net_income, revenue, out=np.zeros_like(net_income), where=revenue!=0)
    roa = np.divide(net_income, total_assets, out=np.zeros_like(net_income), where=total_assets!=0)
    
    df['profit_margin'] = profit_margin * 100
    df['roa'] = roa * 100
    
    return df
```

### Error Handling and Validation

**Comprehensive Validation Framework**:
```python
class FinancialDataValidator:
    """Comprehensive financial data validation"""
    
    def __init__(self):
        self.validation_rules = self._define_validation_rules()
        
    def validate_dataset(self, df):
        """Run all validation checks"""
        validation_results = {
            'passed': True,
            'errors': [],
            'warnings': [],
            'stats': {}
        }
        
        for rule_name, rule_func in self.validation_rules.items():
            try:
                result = rule_func(df)
                if not result['passed']:
                    validation_results['passed'] = False
                    validation_results['errors'].extend(result.get('errors', []))
                    validation_results['warnings'].extend(result.get('warnings', []))
            except Exception as e:
                validation_results['passed'] = False
                validation_results['errors'].append(f"Validation rule '{rule_name}' failed: {str(e)}")
        
        return validation_results
    
    def _define_validation_rules(self):
        """Define comprehensive validation rules"""
        return {
            'balance_sheet_balance': self._validate_balance_sheet,
            'income_statement_logic': self._validate_income_statement,
            'ratio_reasonableness': self._validate_ratios,
            'data_completeness': self._validate_completeness,
            'temporal_consistency': self._validate_temporal_data
        }
    
    def _validate_balance_sheet(self, df):
        """Validate balance sheet equation"""
        if 'total_assets' not in df.columns or 'total_liabilities' not in df.columns:
            return {'passed': True, 'warnings': ['Balance sheet columns not found']}
        
        balance_check = abs(df['total_assets'] - (df['total_liabilities'] + df['shareholders_equity']))
        tolerance = df['total_assets'] * 0.001  # 0.1% tolerance
        
        failed_records = balance_check > tolerance
        
        return {
            'passed': not failed_records.any(),
            'errors': [f"Balance sheet doesn't balance for {failed_records.sum()} records"] if failed_records.any() else [],
            'stats': {'failed_records': failed_records.sum()}
        }
```

This comprehensive knowledge base provides complete coverage of corporate finance KPIs, time-based calculations, and their implementations in both SQL and Python. Each section is self-contained and includes practical examples, making it ideal for vector database ingestion and retrieval. The document covers everything from basic ratio calculations to advanced portfolio analytics, with emphasis on production-ready implementations and industry best practices.

## 7. Vector Database Optimization for DataRobot

### Document Structure for DataRobot Ingestion

**Metadata Schema for Enhanced Retrieval**:
```python
# Recommended metadata structure for each knowledge chunk
metadata_schema = {
    "category": "finance_term|kpi|calculation|implementation|best_practice",
    "subcategory": "profitability|liquidity|leverage|efficiency|valuation|time_series",
    "complexity": "basic|intermediate|advanced",
    "implementation_type": "sql|python|both",
    "data_frequency": "daily|monthly|quarterly|annual",
    "industry_applicability": "all|finance|tech|retail|manufacturing",
    "calculation_type": "ratio|growth|aggregate|trend|forecast",
    "dependencies": "list_of_required_fields",
    "performance_tier": "fast|medium|slow",
    "version": "1.0"
}
```

### Chunking Strategy for Optimal Retrieval

**Chunk Size Recommendations**:
- For standard embeddings (512 token limit): Keep chunks to ~400 tokens to allow for query expansion
- For long-context embeddings (8192 tokens): Use ~6000 token chunks for comprehensive context
- Each chunk should be self-contained with concept + implementation

**Example Chunk Structure**:
```markdown
## [CHUNK_START: EBITDA_CALCULATION]
### Metadata
- category: kpi
- subcategory: profitability
- complexity: intermediate
- implementation_type: both
- data_frequency: quarterly
- calculation_type: aggregate

### EBITDA (Earnings Before Interest, Taxes, Depreciation, and Amortization)

**Definition**: EBITDA measures operating profitability excluding non-operating expenses.

**Formula**: `EBITDA = Net Income + Interest + Taxes + Depreciation + Amortization`

**SQL Implementation**:
```sql
SELECT 
    company_id,
    net_income + interest_expense + tax_expense + depreciation + amortization AS ebitda
FROM income_statement;
```

**Python Implementation**:
```python
def calculate_ebitda(df):
    df['ebitda'] = df[['net_income', 'interest_expense', 'tax_expense', 
                       'depreciation', 'amortization']].sum(axis=1)
    return df
```

**Industry Benchmarks**: 10-20% margin is good, >20% excellent.
## [CHUNK_END: EBITDA_CALCULATION]
```

### DataRobot-Specific CSV Format

**Required CSV Structure for Upload**:
```python
import pandas as pd
import hashlib

def prepare_datarobot_csv(knowledge_base_chunks):
    """Prepare knowledge base for DataRobot vector database"""
    
    datarobot_data = []
    
    for chunk in knowledge_base_chunks:
        # Generate unique document ID
        doc_id = hashlib.md5(chunk['content'].encode()).hexdigest()
        
        # Create row with required columns
        row = {
            'document': chunk['content'],  # Required: Main text content
            'document_file_path': f"finance_kb/{chunk['category']}/{doc_id}",  # Required: Reference ID
            
            # Metadata columns (up to 50 allowed)
            'category': chunk['category'],
            'subcategory': chunk['subcategory'],
            'complexity': chunk['complexity'],
            'implementation_type': chunk['implementation_type'],
            'keywords': chunk.get('keywords', ''),
            'dependencies': chunk.get('dependencies', ''),
            'last_updated': chunk.get('last_updated', '2024-01-01')
        }
        
        datarobot_data.append(row)
    
    # Convert to DataFrame
    df = pd.DataFrame(datarobot_data)
    
    # Validate required columns
    assert 'document' in df.columns, "Missing required 'document' column"
    assert 'document_file_path' in df.columns, "Missing required 'document_file_path' column"
    
    return df

# Example usage
chunks = [
    {
        'content': 'EBITDA calculation content...',
        'category': 'kpi',
        'subcategory': 'profitability',
        'complexity': 'intermediate',
        'implementation_type': 'both',
        'keywords': 'ebitda,profitability,operating income'
    }
]

datarobot_df = prepare_datarobot_csv(chunks)
datarobot_df.to_csv('finance_knowledge_base.csv', index=False)
```

### Embedding Model Selection Guide

**For Finance Knowledge Base**:
```python
def select_embedding_model(content_characteristics):
    """Select optimal embedding model based on content"""
    
    if content_characteristics['language'] == 'english':
        if content_characteristics['avg_chunk_size'] > 2000:
            # For long financial documents/reports
            return 'jinaai/jina-embedding-s-en-v2'  # 8192 input dimension
        elif content_characteristics['query_latency_requirement'] == 'low':
            # For real-time financial calculations
            return 'jinaai/jina-embedding-t-en-v1'  # 14M params, fastest
        else:
            # Balanced performance for most use cases
            return 'intfloat/e5-base-v2'  # 768 output dimension
    
    elif content_characteristics['language'] == 'multilingual':
        if content_characteristics['query_latency_requirement'] == 'low':
            return 'huggingface.co/intfloat/multilingual-e5-small'
        else:
            return 'huggingface.co/intfloat/multilingual-e5-base'
    
    # Default fallback
    return 'jinaai/jina-embedding-t-en-v1'
```

### Query Optimization Strategies

**Metadata Filtering for Efficient Retrieval**:
```python
def construct_filtered_query(user_query, filters):
    """Construct query with metadata filters for DataRobot"""
    
    query_config = {
        'query': user_query,
        'filters': {
            'source': filters.get('document_file_path'),  # Note: displayed as 'source' in DataRobot
            'category': filters.get('category'),
            'complexity': filters.get('complexity'),
            'implementation_type': filters.get('implementation_type')
        },
        'top_k': filters.get('top_k', 5)
    }
    
    # Example filters for different use cases
    use_case_filters = {
        'basic_sql_only': {
            'complexity': 'basic',
            'implementation_type': 'sql'
        },
        'advanced_python_analytics': {
            'complexity': 'advanced',
            'implementation_type': 'python',
            'category': 'implementation'
        },
        'time_series_calculations': {
            'subcategory': 'time_series',
            'calculation_type': 'trend'
        }
    }
    
    return query_config
```

### Content Preprocessing for Optimal Embeddings

**Text Preparation Pipeline**:
```python
def preprocess_finance_content(text):
    """Preprocess financial content for embedding"""
    import re
    
    # Preserve important financial notation
    text = re.sub(r'\$(\d+(?:,\d{3})*(?:\.\d{2})?)', r'CURRENCY_\1', text)
    text = re.sub(r'(\d+(?:\.\d+)?)\%', r'PERCENT_\1', text)
    
    # Standardize financial terms
    replacements = {
        'year-over-year': 'YoY',
        'year-to-date': 'YTD',
        'quarter-over-quarter': 'QoQ',
        'return on equity': 'ROE',
        'return on assets': 'ROA',
        'price to earnings': 'P/E'
    }
    
    for old, new in replacements.items():
        text = text.replace(old, new)
        text = text.replace(old.lower(), new)
    
    # Preserve code blocks
    code_blocks = re.findall(r'```[\s\S]*?```', text)
    for i, block in enumerate(code_blocks):
        text = text.replace(block, f'CODE_BLOCK_{i}')
    
    # Clean whitespace while preserving structure
    text = re.sub(r'\n{3,}', '\n\n', text)
    text = re.sub(r' {2,}', ' ', text)
    
    # Restore code blocks
    for i, block in enumerate(code_blocks):
        text = text.replace(f'CODE_BLOCK_{i}', block)
    
    return text.strip()
```

### Performance Considerations

**Batch Processing for Large Knowledge Bases**:
```python
def batch_process_knowledge_base(documents, batch_size=100):
    """Process large knowledge bases in batches for DataRobot"""
    
    total_docs = len(documents)
    batches = []
    
    for i in range(0, total_docs, batch_size):
        batch = documents[i:i + batch_size]
        
        # Process batch
        processed_batch = []
        for doc in batch:
            processed_doc = {
                'document': preprocess_finance_content(doc['content']),
                'document_file_path': doc['path'],
                **doc['metadata']  # Spread metadata columns
            }
            processed_batch.append(processed_doc)
        
        batches.append(pd.DataFrame(processed_batch))
        
        # Log progress
        print(f"Processed {min(i + batch_size, total_docs)}/{total_docs} documents")
    
    # Combine all batches
    final_df = pd.concat(batches, ignore_index=True)
    
    # Validate final structure
    assert len(final_df.columns) <= 52, "Exceeded 50 metadata columns limit"
    
    return final_df
```