Spaces:

CraigRoberts15
/

Business-Intelligence-Dashboard

Running

App Files Files Community

CraigRoberts15 commited on 6 days ago

Commit

c51e926

0 Parent(s):

Initial commit: Business Intelligence Dashboard with Git LFS

Browse files

Files changed (22) hide show

.gitattributes +5 -0
README.md +232 -0
app.py +1379 -0
data/.DS_Store +0 -0
data/Airbnb.csv +3 -0
data/Online_Retail.xlsx +3 -0
data_processor.py +819 -0
insights.py +897 -0
requirements.txt +7 -0
tests/__pycache__/conftest.cpython-310-pytest-8.4.2.pyc +0 -0
tests/__pycache__/test_data_processor.cpython-310-pytest-8.4.2.pyc +0 -0
tests/__pycache__/test_insights.cpython-310-pytest-8.4.2.pyc +0 -0
tests/__pycache__/test_utils.cpython-310-pytest-8.4.2.pyc +0 -0
tests/__pycache__/test_visualizations.cpython-310-pytest-8.4.2.pyc +0 -0
tests/conftest.py +5 -0
tests/test_app.py +0 -0
tests/test_data_processor.py +453 -0
tests/test_insights.py +554 -0
tests/test_utils.py +436 -0
tests/test_visualizations.py +665 -0
utils.py +480 -0
visualizations.py +760 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,5 @@

+# Auto detect text files and perform LF normalization
+* text=auto
+data/*.csv filter=lfs diff=lfs merge=lfs -text
+data/*.xlsx filter=lfs diff=lfs merge=lfs -text
+data/*.xls filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,232 @@

+# 📊 Business Intelligence Dashboard
+A professional, interactive Business Intelligence dashboard built with Gradio that enables non-technical stakeholders to explore and analyze business data. The application allows users to upload datasets, apply filters, generate visualizations, and extract actionable insights—all through an intuitive web interface.
+## 🌟 Features
+### 📂 Data Management
+- **Pre-loaded Datasets**: Online Retail and Airbnb datasets included
+- **Custom Upload**: Support for CSV, Excel (.xlsx, .xls), JSON, and Parquet files (max 50MB)
+- **Automatic Data Cleaning**: Handles missing values, type conversions, and duplicate removal
+- **Data Validation**: Comprehensive error handling and user-friendly error messages
+### 📈 Statistics & Profiling
+- **Automated Data Profiling**: Get instant insights into your dataset
+- **Numerical Summary**: Mean, median, std deviation, quartiles, min/max
+- **Categorical Analysis**: Unique values, value counts, mode
+- **Missing Values Report**: Identify data quality issues
+- **Correlation Matrix**: Visual correlation heatmap for numerical features
+### 🔍 Interactive Filtering
+- **Dynamic Filters**: Filter by numerical ranges, categorical values, or date ranges
+- **Real-time Updates**: See row counts update as you apply filters
+- **Multiple Filters**: Combine multiple filters for precise data exploration
+- **Filter Management**: Easy to add, view, and clear filters
+### 📉 Smart Visualizations
+- **AI-Powered Recommendations**: Get intelligent visualization suggestions based on your data
+- **One-Click Creation**: Create recommended visualizations with a single click
+- **5 Visualization Types**:
+  - Time Series Plots (with aggregation: sum, mean, count, median)
+  - Distribution Plots (histogram, box plot)
+  - Category Analysis (bar chart, pie chart)
+  - Scatter Plots (with color coding and trend lines)
+  - Correlation Heatmap
+- **Dual Backend**: Supports both Matplotlib and Plotly
+- **Customization**: Full control over columns, aggregations, and visual parameters
+### 💡 Automated Insights
+- **Top/Bottom Performers**: Identify highest and lowest values
+- **Trend Analysis**: Detect patterns over time with growth rate and volatility
+- **Anomaly Detection**: Find outliers using Z-score or IQR methods
+- **Distribution Analysis**: Understand data distributions with skewness and kurtosis
+- **Correlation Insights**: Discover strong relationships between variables
+### 💾 Export Capabilities
+- **Data Export**: Export filtered data as CSV or Excel
+- **Visualization Export**: Save charts as PNG images
+## 🏗️ Architecture & Design
+### SOLID Principles Implementation
+- **Single Responsibility**: Each class has one clear purpose
+- **Open/Closed**: Extensible through Strategy Pattern without modifying existing code
+- **Liskov Substitution**: All strategies are interchangeable
+- **Interface Segregation**: Specific interfaces for different operations
+- **Dependency Inversion**: Depends on abstractions, not concrete implementations
+### Design Patterns
+- **Strategy Pattern**: Used for data loading, visualizations, and insights
+- **Facade Pattern**: DataProcessor provides simple interface to complex operations
+- **Factory Pattern**: Dynamic strategy selection based on file type
+### Project Structure
+```
+Business-Intelligence-Dashboard/
+├── app.py                      # Main Gradio application with 6 tabs
+├── data_processor.py           # Data loading, cleaning, filtering (Strategy Pattern)
+├── visualizations.py           # Chart creation with multiple strategies
+├── insights.py                 # Automated insight generation
+├── utils.py                    # Utility functions and validators
+├── requirements.txt            # Python dependencies
+├── README.md                   # This file
+├── data/                       # Sample datasets
+│   ├── Online_Retail.xlsx
+│   └── Airbnb.csv
+└── tests/                      # Comprehensive test suite
+├── init.py
+├── conftest.py
+├── test_utils.py
+├── test_data_processor.py
+├── test_visualizations.py
+└── test_insights.py
+```
+## 🚀 Getting Started
+### Prerequisites
+- Python 3.8 or higher
+- pip package manager
+### Installation
+1. **Clone the repository**
+```bash
+git clone https://github.com/YOUR_USERNAME/Business-Intelligence-Dashboard.git
+cd Business-Intelligence-Dashboard
+```
+2. **Create a virtual environment**
+```bash
+# On macOS/Linux
+python3 -m venv venv
+source venv/bin/activate
+# On Windows
+python -m venv venv
+venv\Scripts\activate
+```
+3. **Install dependencies**
+```bash
+pip install -r requirements.txt
+```
+4. **Run the application**
+```bash
+python app.py
+```
+The dashboard will launch and open in your default browser at `http://localhost:7860`
+## 📖 Usage Guide
+### 1. Loading Data
+- **Option A**: Select "Online Retail" or "Airbnb" from the dropdown
+- **Option B**: Upload your own dataset (CSV, Excel, JSON, or Parquet)
+### 2. Exploring Statistics
+- Navigate to "Statistics & Profiling" tab
+- Click "Generate Data Profile" to see comprehensive statistics
+- View missing values, numerical summaries, and correlation matrix
+### 3. Filtering Data
+- Go to "Filter & Explore" tab
+- Select filter type (Numerical, Categorical, or Date)
+- Choose column and set filter criteria
+- Click "Add Filter" and see real-time updates
+### 4. Creating Visualizations
+- Navigate to "Visualizations" tab
+- **Smart Recommendations**: Click "Get Visualization Recommendations" for AI-powered suggestions
+- **Custom Visualizations**: Select visualization type and configure parameters
+- Supported charts: Time Series, Distribution, Category, Scatter, Correlation
+### 5. Generating Insights
+- Go to "Insights" tab
+- Click "Generate All Insights" for automated analysis
+- Or select specific insight type for targeted analysis
+### 6. Exporting Results
+- Navigate to "Export" tab
+- Choose format (CSV or Excel)
+- Click "Export Data" to download filtered dataset
+## 🧪 Testing
+Run the comprehensive test suite:
+```bash
+# Run all tests
+pytest tests/ -v
+# Run specific test file
+pytest tests/test_utils.py -v
+# Run with coverage
+pytest tests/ --cov=. --cov-report=html
+```
+Test coverage includes:
+- **180+ test cases** across all modules
+- Unit tests for all functions and classes
+- Strategy Pattern implementation tests
+- Edge case and error handling tests
+## 🛠️ Technologies Used
+- **Gradio**: Web interface and interactive components
+- **Pandas**: Data manipulation and analysis
+- **NumPy**: Numerical computations
+- **Matplotlib/Seaborn**: Static visualizations
+- **Plotly**: Interactive visualizations
+- **Python 3.10+**: Core programming language
+## 📊 Sample Datasets
+### Online Retail Dataset
+- **8 columns**: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country
+- **Use case**: E-commerce sales analysis, product trends, customer analysis
+### Airbnb Dataset
+- **26 columns**: Including price, location, room type, reviews, availability
+- **Use case**: Pricing analysis, location trends, booking patterns
+## 🤝 Contributing
+Contributions are welcome! Please follow these steps:
+1. Fork the repository
+2. Create a feature branch (`git checkout -b feature/AmazingFeature`)
+3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
+4. Push to the branch (`git push origin feature/AmazingFeature`)
+5. Open a Pull Request
+### Development Guidelines
+- Follow PEP 8 style guidelines
+- Add docstrings to all functions
+- Include unit tests for new features
+- Update README.md for significant changes
+## 👨‍💻 Author
+**Craig Roberts**
+## 🙏 Acknowledgments
+- Northeastern University - CS5130 Course (Prof Lino)
+- Dataset sources: UCI ML Repository, Kaggle
+## ⚡ Performance Notes
+- Handles datasets up to 50MB efficiently
+- Optimized for 1,000-10,000 rows
+- Tested with datasets containing 100+ columns
+- Real-time filtering with sub-second response times
+## 🐛 Known Issues
+- Large datasets (>100MB) may cause memory issues
+- Some complex visualizations may take time to render
+- Browser storage not available (by design for security)
+---

app.py ADDED Viewed

	@@ -0,0 +1,1379 @@

+"""
+Business Intelligence Dashboard - Main Gradio Application
+This application provides an interactive BI dashboard with automated insights,
+visualizations, and data exploration capabilities.
+Author: Craig
+Date: December 2024
+"""
+import gradio as gr
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+from pathlib import Path
+from typing import Optional, Dict, Any, List, Tuple
+import logging
+from data_processor import DataProcessor, DataProfiler, DataFilter
+from visualizations import VisualizationManager, save_visualization
+from insights import InsightManager
+from utils import (
+    get_column_types, format_number, format_percentage,
+    Config, CSVExporter, ExcelExporter
+)
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Global state management
+class AppState:
+    """
+    Manages application state across tabs.
+    Follows Single Responsibility Principle - only manages state.
+    """
+    def __init__(self):
+        self.processor = DataProcessor()
+        self.viz_manager = VisualizationManager()
+        self.insight_manager = InsightManager()
+        # Available datasets
+        self.datasets = {
+            'Online Retail': 'data/Online_Retail.xlsx',
+            'Airbnb': 'data/Airbnb.csv'
+        }
+        # Current session data
+        self.current_dataset_name = None
+        self.current_df = None
+        self.filtered_df = None
+        self.active_filters = []
+        self.current_recommendations = None
+    def load_dataset(self, dataset_name: str, file_path: Optional[str] = None) -> Tuple[pd.DataFrame, str]:
+        """
+        Load dataset by name or from uploaded file.
+        Args:
+            dataset_name: Name of dataset to load
+            file_path: Optional path to uploaded file
+        Returns:
+            Tuple of (DataFrame, status_message)
+        """
+        try:
+            if file_path:
+                # Load uploaded file
+                df = self.processor.load_and_prepare_data(file_path)
+                self.current_dataset_name = f"Uploaded: {Path(file_path).name}"
+            else:
+                # Load predefined dataset
+                if dataset_name not in self.datasets:
+                    return None, f"❌ Dataset '{dataset_name}' not found"
+                file_path = self.datasets[dataset_name]
+                df = self.processor.load_and_prepare_data(file_path)
+                self.current_dataset_name = dataset_name
+            self.current_df = df
+            self.filtered_df = df.copy()
+            self.active_filters = []
+            self.current_recommendations = None
+            message = f"✅ Successfully loaded '{self.current_dataset_name}' - {len(df)} rows, {len(df.columns)} columns"
+            logger.info(message)
+            return df, message
+        except Exception as e:
+            error_msg = f"❌ Error loading dataset: {str(e)}"
+            logger.error(error_msg)
+            return None, error_msg
+    def get_column_info(self) -> Dict[str, List[str]]:
+        """Get categorized column information."""
+        if self.current_df is None:
+            return {'numerical': [], 'categorical': [], 'datetime': []}
+        return get_column_types(self.current_df)
+    def apply_filters(self, filters: List[Dict]) -> pd.DataFrame:
+        """Apply filters to current dataset."""
+        if self.current_df is None:
+            return None
+        self.active_filters = filters
+        self.filtered_df = self.processor.apply_filters(filters)
+        return self.filtered_df
+    def reset_filters(self) -> pd.DataFrame:
+        """Reset all filters."""
+        if self.current_df is None:
+            return None
+        self.filtered_df = self.current_df.copy()
+        self.active_filters = []
+        return self.filtered_df
+# Initialize global state
+app_state = AppState()
+# ============================================================================
+# SMART VISUALIZATION RECOMMENDATIONS
+# ============================================================================
+class SmartVisualizationRecommender:
+    """
+    Recommends best visualizations based on data characteristics.
+    Follows Single Responsibility Principle - only handles recommendations.
+    """
+    @staticmethod
+    def analyze_dataset(df: pd.DataFrame) -> Dict[str, Any]:
+        """
+        Analyze dataset and recommend visualizations.
+        Args:
+            df: DataFrame to analyze
+        Returns:
+            Dict with recommendations
+        """
+        column_types = get_column_types(df)
+        recommendations = []
+        # Time Series Recommendations
+        if len(column_types['datetime']) > 0 and len(column_types['numerical']) > 0:
+            recommendations.append({
+                'type': 'time_series',
+                'priority': 'high',
+                'reason': 'Detected date and numerical columns - perfect for trend analysis',
+                'suggested_params': {
+                    'date_column': column_types['datetime'][0],
+                    'value_column': column_types['numerical'][0],
+                    'aggregation': 'sum'
+                }
+            })
+        # Correlation Heatmap Recommendations
+        if len(column_types['numerical']) >= 3:
+            recommendations.append({
+                'type': 'correlation',
+                'priority': 'high',
+                'reason': f'Found {len(column_types["numerical"])} numerical columns - great for correlation analysis',
+                'suggested_params': {}
+            })
+        # Category Analysis Recommendations
+        if len(column_types['categorical']) > 0:
+            cat_col = column_types['categorical'][0]
+            unique_count = df[cat_col].nunique()
+            if unique_count <= 10:
+                recommendations.append({
+                    'type': 'category',
+                    'priority': 'high',
+                    'reason': f'Found categorical column "{cat_col}" with {unique_count} categories',
+                    'suggested_params': {
+                        'column': cat_col,
+                        'plot_type': 'bar'
+                    }
+                })
+        # Distribution Recommendations
+        if len(column_types['numerical']) > 0:
+            recommendations.append({
+                'type': 'distribution',
+                'priority': 'medium',
+                'reason': 'Numerical data available - useful for understanding value distribution',
+                'suggested_params': {
+                    'column': column_types['numerical'][0],
+                    'plot_type': 'histogram'
+                }
+            })
+        # Scatter Plot Recommendations
+        if len(column_types['numerical']) >= 2:
+            recommendations.append({
+                'type': 'scatter',
+                'priority': 'medium',
+                'reason': 'Multiple numerical columns - explore relationships between variables',
+                'suggested_params': {
+                    'x_column': column_types['numerical'][0],
+                    'y_column': column_types['numerical'][1]
+                }
+            })
+        # Sort by priority
+        priority_order = {'high': 0, 'medium': 1, 'low': 2}
+        recommendations.sort(key=lambda x: priority_order[x['priority']])
+        return {
+            'column_types': column_types,
+            'recommendations': recommendations,
+            'summary': SmartVisualizationRecommender._generate_summary(recommendations)
+        }
+    @staticmethod
+    def _generate_summary(recommendations: List[Dict]) -> str:
+        """Generate human-readable summary of recommendations."""
+        if not recommendations:
+            return "No specific visualization recommendations available."
+        high_priority = [r for r in recommendations if r['priority'] == 'high']
+        if high_priority:
+            summary = f"🎯 **Top Recommendation**: {high_priority[0]['type'].replace('_', ' ').title()}\n"
+            summary += f"💡 {high_priority[0]['reason']}\n\n"
+            if len(high_priority) > 1:
+                summary += f"Also recommended: {', '.join([r['type'].replace('_', ' ').title() for r in high_priority[1:]])}"
+        else:
+            summary = f"📊 Recommended: {recommendations[0]['type'].replace('_', ' ').title()}"
+        return summary
+# ============================================================================
+# TAB 1: DATASET SELECTION
+# ============================================================================
+def create_dataset_tab():
+    """Create dataset selection and preview tab."""
+    with gr.Tab("📊 Dataset Selection"):
+        gr.Markdown("## Select or Upload Dataset")
+        gr.Markdown("Choose from pre-loaded datasets or upload your own (CSV, Excel, JSON, Parquet)")
+        with gr.Row():
+            with gr.Column(scale=1):
+                dataset_dropdown = gr.Dropdown(
+                    choices=list(app_state.datasets.keys()),
+                    label="Pre-loaded Datasets",
+                    value=None
+                )
+                load_btn = gr.Button("📂 Load Selected Dataset", variant="primary")
+                gr.Markdown("### OR Upload Your Own Dataset")
+                file_upload = gr.File(
+                    label="Upload Dataset (Max 50MB)",
+                    file_types=[".csv", ".xlsx", ".xls", ".json", ".parquet"]
+                )
+                upload_btn = gr.Button("📤 Upload & Process", variant="secondary")
+            with gr.Column(scale=1):
+                status_box = gr.Textbox(
+                    label="Status",
+                    value="No dataset loaded",
+                    interactive=False,
+                    lines=3
+                )
+                dataset_info = gr.Textbox(
+                    label="Dataset Information",
+                    value="",
+                    interactive=False,
+                    lines=8
+                )
+        gr.Markdown("### Data Preview")
+        data_preview = gr.Dataframe(
+            label="First 100 rows",
+            interactive=False,
+            wrap=True
+        )
+        # Event handlers
+        def load_predefined_dataset(dataset_name):
+            if not dataset_name:
+                return None, "⚠️ Please select a dataset", "", None
+            df, status = app_state.load_dataset(dataset_name)
+            if df is not None:
+                info = f"📊 **Dataset**: {dataset_name}\n"
+                info += f"📏 **Shape**: {df.shape[0]} rows × {df.shape[1]} columns\n"
+                info += f"💾 **Memory**: {df.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB\n\n"
+                info += f"**Column Types**:\n"
+                col_types = get_column_types(df)
+                info += f"- Numerical: {len(col_types['numerical'])}\n"
+                info += f"- Categorical: {len(col_types['categorical'])}\n"
+                info += f"- DateTime: {len(col_types['datetime'])}\n"
+                preview = df.head(100)
+                return dataset_name, status, info, preview
+            return None, status, "", None
+        def upload_custom_dataset(file):
+            if file is None:
+                return "⚠️ Please upload a file", "", None
+            # Check file size (50MB limit)
+            file_size_mb = Path(file.name).stat().st_size / (1024 * 1024)
+            if file_size_mb > 50:
+                return f"❌ File too large ({file_size_mb:.1f}MB). Maximum size: 50MB", "", None
+            df, status = app_state.load_dataset("uploaded", file.name)
+            if df is not None:
+                info = f"📊 **Dataset**: {Path(file.name).name}\n"
+                info += f"📏 **Shape**: {df.shape[0]} rows × {df.shape[1]} columns\n"
+                info += f"💾 **Memory**: {df.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB\n\n"
+                info += f"**Column Types**:\n"
+                col_types = get_column_types(df)
+                info += f"- Numerical: {len(col_types['numerical'])}\n"
+                info += f"- Categorical: {len(col_types['categorical'])}\n"
+                info += f"- DateTime: {len(col_types['datetime'])}\n"
+                preview = df.head(100)
+                return status, info, preview
+            return status, "", None
+        load_btn.click(
+            fn=load_predefined_dataset,
+            inputs=[dataset_dropdown],
+            outputs=[dataset_dropdown, status_box, dataset_info, data_preview]
+        )
+        upload_btn.click(
+            fn=upload_custom_dataset,
+            inputs=[file_upload],
+            outputs=[status_box, dataset_info, data_preview]
+        )
+    return dataset_dropdown, status_box, dataset_info, data_preview, load_btn, upload_btn
+# ============================================================================
+# TAB 2: STATISTICS & PROFILING
+# ============================================================================
+def create_statistics_tab():
+    """Create statistics and data profiling tab."""
+    with gr.Tab("📈 Statistics & Profiling"):
+        gr.Markdown("## Data Profiling & Summary Statistics")
+        profile_btn = gr.Button("🔍 Generate Data Profile", variant="primary")
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown("### Missing Values Report")
+                missing_values = gr.Dataframe(label="Missing Values")
+            with gr.Column():
+                gr.Markdown("### Numerical Summary")
+                numerical_summary = gr.Dataframe(label="Descriptive Statistics")
+        gr.Markdown("### Categorical Summary")
+        categorical_summary = gr.Textbox(
+            label="Categorical Variables",
+            lines=10,
+            interactive=False
+        )
+        gr.Markdown("### Correlation Matrix")
+        correlation_plot = gr.Plot(label="Correlation Heatmap")
+        def generate_profile():
+            if app_state.current_df is None:
+                return (
+                    None, None, "⚠️ No dataset loaded. Please load a dataset first.", None
+                )
+            try:
+                profile = app_state.processor.get_data_profile()
+                # Missing values
+                missing_df = profile['missing_values']
+                # Numerical summary
+                num_summary = profile['numerical_summary']
+                # Categorical summary - FIXED
+                cat_summary = profile['categorical_summary']
+                cat_text = ""
+                for col, stats in cat_summary.items():
+                    cat_text += f"\n**{col}**:\n"
+                    cat_text += f"  - Unique values: {stats['unique_count']}\n"
+                    # Safe handling of top_value
+                    top_val = stats.get('top_value', 'N/A')
+                    if pd.isna(top_val):
+                        top_val = 'N/A'
+                    cat_text += f"  - Most common: {top_val} ({stats['top_value_frequency']} occurrences)\n"
+                    # Safe handling of value_counts
+                    if stats.get('value_counts'):
+                        top_values = list(stats['value_counts'].keys())[:5]
+                        cat_text += f"  - Top values: {', '.join(str(v) for v in top_values)}\n"
+                if not cat_text:
+                    cat_text = "No categorical columns found."
+                # Correlation matrix
+                corr_matrix = profile['correlation_matrix']
+                if not corr_matrix.empty and len(corr_matrix.columns) >= 2:
+                    fig = app_state.viz_manager.create_visualization(
+                        'correlation',
+                        app_state.current_df,
+                        backend='matplotlib'
+                    )
+                else:
+                    fig = None
+                return missing_df, num_summary, cat_text, fig
+            except Exception as e:
+                logger.error(f"Error generating profile: {e}")
+                import traceback
+                traceback.print_exc()
+                return None, None, f"❌ Error: {str(e)}", None
+        profile_btn.click(
+            fn=generate_profile,
+            outputs=[missing_values, numerical_summary, categorical_summary, correlation_plot]
+        )
+    return profile_btn, missing_values, numerical_summary, categorical_summary, correlation_plot
+# ============================================================================
+# TAB 3: FILTER & EXPLORE
+# ============================================================================
+def create_filter_tab():
+    """Create interactive filtering tab."""
+    with gr.Tab("🔍 Filter & Explore"):
+        gr.Markdown("## Interactive Data Filtering")
+        gr.Markdown("Apply filters to narrow down your data for analysis")
+        with gr.Row():
+            with gr.Column(scale=1):
+                gr.Markdown("### Filter Controls")
+                filter_type = gr.Radio(
+                    choices=["Numerical Range", "Categorical Values", "Date Range"],
+                    label="Filter Type",
+                    value="Numerical Range"
+                )
+                column_select = gr.Dropdown(
+                    label="Select Column",
+                    choices=[],
+                    interactive=True
+                )
+                # Numerical filters
+                with gr.Group(visible=True) as numerical_group:
+                    min_value = gr.Number(label="Minimum Value")
+                    max_value = gr.Number(label="Maximum Value")
+                # Categorical filters
+                with gr.Group(visible=False) as categorical_group:
+                    category_select = gr.CheckboxGroup(
+                        label="Select Values",
+                        choices=[]
+                    )
+                # Date filters
+                with gr.Group(visible=False) as date_group:
+                    start_date = gr.Textbox(label="Start Date (YYYY-MM-DD)")
+                    end_date = gr.Textbox(label="End Date (YYYY-MM-DD)")
+                add_filter_btn = gr.Button("➕ Add Filter", variant="primary")
+                clear_filters_btn = gr.Button("🗑️ Clear All Filters", variant="secondary")
+            with gr.Column(scale=2):
+                filter_status = gr.Textbox(
+                    label="Active Filters",
+                    value="No filters applied",
+                    lines=5,
+                    interactive=False
+                )
+                row_count = gr.Textbox(
+                    label="Filtered Row Count",
+                    value="0 rows",
+                    interactive=False
+                )
+                filtered_preview = gr.Dataframe(
+                    label="Filtered Data Preview",
+                    interactive=False
+                )
+        def update_column_choices(filter_type_value):
+            if app_state.current_df is None:
+                return gr.Dropdown(choices=[]), gr.Group(visible=False), gr.Group(visible=False), gr.Group(visible=False)
+            col_types = get_column_types(app_state.current_df)
+            if filter_type_value == "Numerical Range":
+                choices = col_types['numerical']
+                return (
+                    gr.Dropdown(choices=choices),
+                    gr.Group(visible=True),
+                    gr.Group(visible=False),
+                    gr.Group(visible=False)
+                )
+            elif filter_type_value == "Categorical Values":
+                choices = col_types['categorical']
+                return (
+                    gr.Dropdown(choices=choices),
+                    gr.Group(visible=False),
+                    gr.Group(visible=True),
+                    gr.Group(visible=False)
+                )
+            else:  # Date Range
+                choices = col_types['datetime']
+                return (
+                    gr.Dropdown(choices=choices),
+                    gr.Group(visible=False),
+                    gr.Group(visible=False),
+                    gr.Group(visible=True)
+                )
+        def update_category_choices(column):
+            if app_state.current_df is None or not column:
+                return gr.CheckboxGroup(choices=[])
+            unique_values = app_state.current_df[column].dropna().unique().tolist()
+            return gr.CheckboxGroup(choices=unique_values[:50])  # Limit to 50 for performance
+        def add_filter(filter_type_value, column, min_val, max_val, categories, start, end):
+            if app_state.current_df is None:
+                return "⚠️ No dataset loaded", "0 rows", None
+            if not column:
+                return "⚠️ Please select a column", f"{len(app_state.filtered_df)} rows", app_state.filtered_df.head(100)
+            # Create filter configuration
+            filter_config = {'column': column}
+            if filter_type_value == "Numerical Range":
+                filter_config['type'] = 'numerical'
+                filter_config['min_val'] = min_val
+                filter_config['max_val'] = max_val
+            elif filter_type_value == "Categorical Values":
+                filter_config['type'] = 'categorical'
+                filter_config['values'] = categories if categories else []
+            else:  # Date Range
+                filter_config['type'] = 'date'
+                filter_config['start_date'] = start if start else None
+                filter_config['end_date'] = end if end else None
+            # Add to active filters
+            app_state.active_filters.append(filter_config)
+            # Apply all filters
+            filtered_df = app_state.apply_filters(app_state.active_filters)
+            # Generate status message
+            status = "**Active Filters:**\n"
+            for i, f in enumerate(app_state.active_filters, 1):
+                status += f"{i}. {f['column']} ({f['type']})\n"
+            row_info = f"{len(filtered_df)} rows (filtered from {len(app_state.current_df)})"
+            return status, row_info, filtered_df.head(100)
+        def clear_all_filters():
+            if app_state.current_df is None:
+                return "No filters applied", "0 rows", None
+            app_state.reset_filters()
+            row_info = f"{len(app_state.current_df)} rows"
+            return "No filters applied", row_info, app_state.current_df.head(100)
+        # Event handlers
+        filter_type.change(
+            fn=update_column_choices,
+            inputs=[filter_type],
+            outputs=[column_select, numerical_group, categorical_group, date_group]
+        )
+        column_select.change(
+            fn=update_category_choices,
+            inputs=[column_select],
+            outputs=[category_select]
+        )
+        add_filter_btn.click(
+            fn=add_filter,
+            inputs=[filter_type, column_select, min_value, max_value, category_select, start_date, end_date],
+            outputs=[filter_status, row_count, filtered_preview]
+        )
+        clear_filters_btn.click(
+            fn=clear_all_filters,
+            outputs=[filter_status, row_count, filtered_preview]
+        )
+    return (filter_type, column_select, filter_status, row_count, filtered_preview,
+            add_filter_btn, clear_filters_btn)
+# ============================================================================
+# TAB 4: VISUALIZATIONS
+# ============================================================================
+def create_visualization_tab():
+    """Create visualization tab with smart recommendations."""
+    with gr.Tab("📉 Visualizations"):
+        gr.Markdown("## Create Visualizations")
+        # Smart Recommendations Section
+        with gr.Accordion("🎯 Smart Recommendations", open=True):
+            recommend_btn = gr.Button("💡 Get Visualization Recommendations", variant="primary")
+            recommendations_output = gr.Markdown(value="Click the button to get recommendations")
+            # Dynamic recommendation buttons
+            with gr.Row(visible=False) as rec_buttons_row:
+                rec_btn_1 = gr.Button("", visible=False, variant="secondary", scale=1)
+                rec_btn_2 = gr.Button("", visible=False, variant="secondary", scale=1)
+                rec_btn_3 = gr.Button("", visible=False, variant="secondary", scale=1)
+            rec_viz_output = gr.Plot(label="Recommended Visualization", visible=False)
+            rec_status = gr.Textbox(label="Status", visible=False, interactive=False)
+            def get_recommendations():
+                if app_state.filtered_df is None or app_state.filtered_df.empty:
+                    return "⚠️ No data available. Please load a dataset first.", gr.Row(visible=False), "", "", "", gr.Plot(visible=False), gr.Textbox(visible=False)
+                recommender = SmartVisualizationRecommender()
+                analysis = recommender.analyze_dataset(app_state.filtered_df)
+                app_state.current_recommendations = analysis['recommendations']
+                output = "## 🎯 Recommended Visualizations\n\n"
+                output += analysis['summary'] + "\n\n"
+                output += "### Click below to create recommended visualizations:\n\n"
+                # Prepare button labels
+                btn_labels = ["", "", ""]
+                for i, rec in enumerate(analysis['recommendations'][:3]):
+                    priority_emoji = "🔴" if rec['priority'] == 'high' else "🟡"
+                    btn_labels[i] = f"{priority_emoji} Create {rec['type'].replace('_', ' ').title()}"
+                return (
+                    output,
+                    gr.Row(visible=True),
+                    gr.Button(value=btn_labels[0], visible=True) if btn_labels[0] else gr.Button(visible=False),
+                    gr.Button(value=btn_labels[1], visible=True) if btn_labels[1] else gr.Button(visible=False),
+                    gr.Button(value=btn_labels[2], visible=True) if btn_labels[2] else gr.Button(visible=False),
+                    gr.Plot(visible=False),
+                    gr.Textbox(visible=False)
+                )
+            def create_recommended_viz(rec_index):
+                if app_state.current_recommendations is None or rec_index >= len(app_state.current_recommendations):
+                    return None, "⚠️ No recommendation available"
+                rec = app_state.current_recommendations[rec_index]
+                try:
+                    if rec['type'] == 'time_series':
+                        params = rec['suggested_params']
+                        fig = app_state.viz_manager.create_visualization(
+                            'time_series',
+                            app_state.filtered_df,
+                            date_column=params['date_column'],
+                            value_column=params['value_column'],
+                            aggregation=params['aggregation'],
+                            backend='matplotlib'
+                        )
+                        status = f"✅ Created recommended time series plot"
+                    elif rec['type'] == 'correlation':
+                        fig = app_state.viz_manager.create_visualization(
+                            'correlation',
+                            app_state.filtered_df,
+                            backend='matplotlib'
+                        )
+                        status = "✅ Created recommended correlation heatmap"
+                    elif rec['type'] == 'category':
+                        params = rec['suggested_params']
+                        fig = app_state.viz_manager.create_visualization(
+                            'category',
+                            app_state.filtered_df,
+                            column=params['column'],
+                            plot_type=params['plot_type'],
+                            backend='matplotlib'
+                        )
+                        status = f"✅ Created recommended category plot"
+                    elif rec['type'] == 'distribution':
+                        params = rec['suggested_params']
+                        fig = app_state.viz_manager.create_visualization(
+                            'distribution',
+                            app_state.filtered_df,
+                            column=params['column'],
+                            plot_type=params['plot_type'],
+                            backend='matplotlib'
+                        )
+                        status = "✅ Created recommended distribution plot"
+                    elif rec['type'] == 'scatter':
+                        params = rec['suggested_params']
+                        fig = app_state.viz_manager.create_visualization(
+                            'scatter',
+                            app_state.filtered_df,
+                            x_column=params['x_column'],
+                            y_column=params['y_column'],
+                            backend='matplotlib'
+                        )
+                        status = "✅ Created recommended scatter plot"
+                    else:
+                        return None, "❌ Unknown recommendation type"
+                    return gr.Plot(value=fig, visible=True), gr.Textbox(value=status, visible=True)
+                except Exception as e:
+                    logger.error(f"Error creating recommended visualization: {e}")
+                    return None, gr.Textbox(value=f"❌ Error: {str(e)}", visible=True)
+            recommend_btn.click(
+                fn=get_recommendations,
+                outputs=[recommendations_output, rec_buttons_row, rec_btn_1, rec_btn_2, rec_btn_3, rec_viz_output, rec_status]
+            )
+            rec_btn_1.click(
+                fn=lambda: create_recommended_viz(0),
+                outputs=[rec_viz_output, rec_status]
+            )
+            rec_btn_2.click(
+                fn=lambda: create_recommended_viz(1),
+                outputs=[rec_viz_output, rec_status]
+            )
+            rec_btn_3.click(
+                fn=lambda: create_recommended_viz(2),
+                outputs=[rec_viz_output, rec_status]
+            )
+        gr.Markdown("---")
+        gr.Markdown("### Create Custom Visualization")
+        with gr.Row():
+            with gr.Column(scale=1):
+                viz_type = gr.Dropdown(
+                    label="Visualization Type",
+                    choices=[
+                        "Time Series",
+                        "Distribution (Histogram)",
+                        "Distribution (Box Plot)",
+                        "Category (Bar Chart)",
+                        "Category (Pie Chart)",
+                        "Scatter Plot",
+                        "Correlation Heatmap"
+                    ],
+                    value="Time Series"
+                )
+                # Dynamic parameter inputs
+                with gr.Group() as time_series_group:
+                    ts_date_col = gr.Dropdown(label="Date Column", choices=[])
+                    ts_value_col = gr.Dropdown(label="Value Column", choices=[])
+                    ts_agg = gr.Dropdown(
+                        label="Aggregation",
+                        choices=["sum", "mean", "count", "median"],
+                        value="sum"
+                    )
+                with gr.Group(visible=False) as distribution_group:
+                    dist_col = gr.Dropdown(label="Column", choices=[])
+                    dist_bins = gr.Slider(label="Number of Bins", minimum=10, maximum=100, value=30, step=5)
+                with gr.Group(visible=False) as category_group:
+                    cat_col = gr.Dropdown(label="Category Column", choices=[])
+                    cat_value_col = gr.Dropdown(label="Value Column (optional)", choices=[])
+                    cat_agg = gr.Dropdown(
+                        label="Aggregation",
+                        choices=["count", "sum", "mean", "median"],
+                        value="count"
+                    )
+                    cat_top_n = gr.Slider(label="Top N Categories", minimum=5, maximum=20, value=10, step=1)
+                with gr.Group(visible=False) as scatter_group:
+                    scatter_x = gr.Dropdown(label="X Column", choices=[])
+                    scatter_y = gr.Dropdown(label="Y Column", choices=[])
+                    scatter_color = gr.Dropdown(label="Color by (optional)", choices=[])
+                    scatter_trend = gr.Checkbox(label="Show Trend Line", value=False)
+                with gr.Group(visible=False) as correlation_group:
+                    corr_method = gr.Dropdown(
+                        label="Correlation Method",
+                        choices=["pearson", "spearman", "kendall"],
+                        value="pearson"
+                    )
+                create_viz_btn = gr.Button("📊 Create Visualization", variant="primary")
+            with gr.Column(scale=2):
+                viz_output = gr.Plot(label="Visualization")
+                viz_status = gr.Textbox(label="Status", lines=2, interactive=False)
+        def update_viz_controls(viz_type_value):
+            if app_state.filtered_df is None:
+                return [gr.Group(visible=False)] * 5 + [gr.Dropdown(choices=[])] * 8
+            col_types = get_column_types(app_state.filtered_df)
+            # FIXED: Return format with value=None to force refresh
+            # [5 Groups] + [8 Dropdowns]
+            # Groups: time_series_group, distribution_group, category_group, scatter_group, correlation_group
+            # Dropdowns: ts_date_col, ts_value_col, dist_col, cat_col, cat_value_col, scatter_x, scatter_y, scatter_color
+            if viz_type_value == "Time Series":
+                return (
+                    gr.Group(visible=True),   # time_series_group
+                    gr.Group(visible=False),  # distribution_group
+                    gr.Group(visible=False),  # category_group
+                    gr.Group(visible=False),  # scatter_group
+                    gr.Group(visible=False),  # correlation_group
+                    gr.Dropdown(choices=col_types['datetime'], value=None),      # ts_date_col
+                    gr.Dropdown(choices=col_types['numerical'], value=None),     # ts_value_col
+                    gr.Dropdown(choices=col_types['numerical'], value=None),     # dist_col
+                    gr.Dropdown(choices=col_types['categorical'], value=None),   # cat_col
+                    gr.Dropdown(choices=col_types['numerical'], value=None),     # cat_value_col
+                    gr.Dropdown(choices=col_types['numerical'], value=None),     # scatter_x
+                    gr.Dropdown(choices=col_types['numerical'], value=None),     # scatter_y
+                    gr.Dropdown(choices=col_types['categorical'] + col_types['numerical'], value=None)  # scatter_color
+                )
+            elif "Distribution" in viz_type_value:
+                return (
+                    gr.Group(visible=False),
+                    gr.Group(visible=True),
+                    gr.Group(visible=False),
+                    gr.Group(visible=False),
+                    gr.Group(visible=False),
+                    gr.Dropdown(choices=col_types['datetime'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),     # dist_col - visible
+                    gr.Dropdown(choices=col_types['categorical'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['categorical'] + col_types['numerical'], value=None)
+                )
+            elif "Category" in viz_type_value:
+                return (
+                    gr.Group(visible=False),
+                    gr.Group(visible=False),
+                    gr.Group(visible=True),
+                    gr.Group(visible=False),
+                    gr.Group(visible=False),
+                    gr.Dropdown(choices=col_types['datetime'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['categorical'], value=None),   # cat_col - visible
+                    gr.Dropdown(choices=col_types['numerical'], value=None),     # cat_value_col - visible
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['categorical'] + col_types['numerical'], value=None)
+                )
+            elif viz_type_value == "Scatter Plot":
+                return (
+                    gr.Group(visible=False),
+                    gr.Group(visible=False),
+                    gr.Group(visible=False),
+                    gr.Group(visible=True),
+                    gr.Group(visible=False),
+                    gr.Dropdown(choices=col_types['datetime'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['categorical'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),     # scatter_x - visible
+                    gr.Dropdown(choices=col_types['numerical'], value=None),     # scatter_y - visible
+                    gr.Dropdown(choices=col_types['categorical'] + col_types['numerical'], value=None)  # scatter_color - visible
+                )
+            else:  # Correlation Heatmap
+                return (
+                    gr.Group(visible=False),
+                    gr.Group(visible=False),
+                    gr.Group(visible=False),
+                    gr.Group(visible=False),
+                    gr.Group(visible=True),
+                    gr.Dropdown(choices=col_types['datetime'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['categorical'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['numerical'], value=None),
+                    gr.Dropdown(choices=col_types['categorical'] + col_types['numerical'], value=None)
+                )
+        def create_visualization(viz_type_value, date_col, value_col, agg,
+                                dist_column, bins, cat_column, cat_value, cat_aggregation, top_n,
+                                x_col, y_col, color_col, trend, corr_method_value):
+            if app_state.filtered_df is None or app_state.filtered_df.empty:
+                return None, "⚠️ No data available"
+            try:
+                if viz_type_value == "Time Series":
+                    if not date_col or not value_col:
+                        return None, "⚠️ Please select date and value columns"
+                    fig = app_state.viz_manager.create_visualization(
+                        'time_series',
+                        app_state.filtered_df,
+                        date_column=date_col,
+                        value_column=value_col,
+                        aggregation=agg,
+                        backend='matplotlib'
+                    )
+                    status = f"✅ Created time series plot: {value_col} over {date_col}"
+                elif "Distribution" in viz_type_value:
+                    if not dist_column:
+                        return None, "⚠️ Please select a column"
+                    plot_type = 'histogram' if 'Histogram' in viz_type_value else 'box'
+                    fig = app_state.viz_manager.create_visualization(
+                        'distribution',
+                        app_state.filtered_df,
+                        column=dist_column,
+                        plot_type=plot_type,
+                        bins=int(bins),
+                        backend='matplotlib'
+                    )
+                    status = f"✅ Created {plot_type} plot for {dist_column}"
+                elif "Category" in viz_type_value:
+                    if not cat_column:
+                        return None, "⚠️ Please select a category column"
+                    plot_type = 'bar' if 'Bar' in viz_type_value else 'pie'
+                    fig = app_state.viz_manager.create_visualization(
+                        'category',
+                        app_state.filtered_df,
+                        column=cat_column,
+                        value_column=cat_value if cat_value else None,
+                        plot_type=plot_type,
+                        aggregation=cat_aggregation,
+                        top_n=int(top_n),
+                        backend='matplotlib'
+                    )
+                    status = f"✅ Created {plot_type} chart for {cat_column}"
+                elif viz_type_value == "Scatter Plot":
+                    if not x_col or not y_col:
+                        return None, "⚠️ Please select X and Y columns"
+                    fig = app_state.viz_manager.create_visualization(
+                        'scatter',
+                        app_state.filtered_df,
+                        x_column=x_col,
+                        y_column=y_col,
+                        color_column=color_col if color_col else None,
+                        show_trend=trend,
+                        backend='matplotlib'
+                    )
+                    status = f"✅ Created scatter plot: {y_col} vs {x_col}"
+                else:  # Correlation Heatmap
+                    fig = app_state.viz_manager.create_visualization(
+                        'correlation',
+                        app_state.filtered_df,
+                        method=corr_method_value,
+                        backend='matplotlib'
+                    )
+                    status = "✅ Created correlation heatmap"
+                return fig, status
+            except Exception as e:
+                logger.error(f"Error creating visualization: {e}")
+                import traceback
+                traceback.print_exc()
+                return None, f"❌ Error: {str(e)}"
+        viz_type.change(
+            fn=update_viz_controls,
+            inputs=[viz_type],
+            outputs=[
+                time_series_group, distribution_group, category_group,
+                scatter_group, correlation_group,
+                ts_date_col, ts_value_col, dist_col, cat_col, cat_value_col,
+                scatter_x, scatter_y, scatter_color
+            ]
+        )
+        create_viz_btn.click(
+            fn=create_visualization,
+            inputs=[
+                viz_type, ts_date_col, ts_value_col, ts_agg,
+                dist_col, dist_bins, cat_col, cat_value_col, cat_agg, cat_top_n,
+                scatter_x, scatter_y, scatter_color, scatter_trend, corr_method
+            ],
+            outputs=[viz_output, viz_status]
+        )
+    return (viz_type, recommend_btn, recommendations_output, rec_buttons_row,
+            rec_btn_1, rec_btn_2, rec_btn_3, rec_viz_output, rec_status,
+            viz_output, viz_status, create_viz_btn)
+# ============================================================================
+# TAB 5: INSIGHTS
+# ============================================================================
+def create_insights_tab():
+    """Create automated insights tab."""
+    with gr.Tab("💡 Insights"):
+        gr.Markdown("## Automated Insights")
+        gr.Markdown("Generate intelligent insights from your data automatically")
+        with gr.Row():
+            generate_all_btn = gr.Button("🚀 Generate All Insights", variant="primary", scale=2)
+            generate_custom_btn = gr.Button("⚙️ Generate Custom Insight", variant="secondary", scale=1)
+        with gr.Row():
+            with gr.Column(scale=1):
+                gr.Markdown("### Custom Insight Options")
+                insight_type = gr.Dropdown(
+                    label="Insight Type",
+                    choices=[
+                        "Top/Bottom Performers",
+                        "Trend Analysis",
+                        "Anomaly Detection",
+                        "Distribution Analysis",
+                        "Correlation Analysis"
+                    ],
+                    value="Top/Bottom Performers"
+                )
+                insight_column = gr.Dropdown(label="Select Column", choices=[])
+                insight_column2 = gr.Dropdown(label="Second Column (for trends)", choices=[], visible=False)
+            with gr.Column(scale=2):
+                insights_output = gr.Textbox(
+                    label="Insights Report",
+                    lines=20,
+                    interactive=False
+                )
+        def update_insight_columns(insight_type_value):
+            if app_state.filtered_df is None:
+                return gr.Dropdown(choices=[]), gr.Dropdown(choices=[], visible=False)
+            col_types = get_column_types(app_state.filtered_df)
+            if insight_type_value == "Trend Analysis":
+                return (
+                    gr.Dropdown(choices=col_types['datetime']),
+                    gr.Dropdown(choices=col_types['numerical'], visible=True)
+                )
+            else:
+                all_cols = col_types['numerical'] + col_types['categorical']
+                return (
+                    gr.Dropdown(choices=all_cols),
+                    gr.Dropdown(choices=[], visible=False)
+                )
+        def generate_all_insights():
+            if app_state.filtered_df is None or app_state.filtered_df.empty:
+                return "⚠️ No data available. Please load a dataset first."
+            try:
+                insights = app_state.insight_manager.generate_all_insights(app_state.filtered_df)
+                report = app_state.insight_manager.format_insight_report(insights)
+                return report
+            except Exception as e:
+                logger.error(f"Error generating insights: {e}")
+                return f"❌ Error generating insights: {str(e)}"
+        def generate_custom_insight(insight_type_value, col1, col2):
+            if app_state.filtered_df is None or app_state.filtered_df.empty:
+                return "⚠️ No data available"
+            if not col1:
+                return "⚠️ Please select a column"
+            try:
+                if insight_type_value == "Top/Bottom Performers":
+                    insight = app_state.insight_manager.generate_insight(
+                        'top_bottom',
+                        app_state.filtered_df,
+                        column=col1
+                    )
+                elif insight_type_value == "Trend Analysis":
+                    if not col2:
+                        return "⚠️ Please select both date and value columns"
+                    insight = app_state.insight_manager.generate_insight(
+                        'trend',
+                        app_state.filtered_df,
+                        date_column=col1,
+                        value_column=col2
+                    )
+                elif insight_type_value == "Anomaly Detection":
+                    insight = app_state.insight_manager.generate_insight(
+                        'anomaly',
+                        app_state.filtered_df,
+                        column=col1
+                    )
+                elif insight_type_value == "Distribution Analysis":
+                    insight = app_state.insight_manager.generate_insight(
+                        'distribution',
+                        app_state.filtered_df,
+                        column=col1
+                    )
+                else:  # Correlation Analysis
+                    insight = app_state.insight_manager.generate_insight(
+                        'correlation',
+                        app_state.filtered_df
+                    )
+                # Format single insight
+                report = f"## {insight_type_value}\n\n"
+                report += f"**Summary**: {insight.get('summary', 'No summary available')}\n\n"
+                return report
+            except Exception as e:
+                logger.error(f"Error generating custom insight: {e}")
+                return f"❌ Error: {str(e)}"
+        insight_type.change(
+            fn=update_insight_columns,
+            inputs=[insight_type],
+            outputs=[insight_column, insight_column2]
+        )
+        generate_all_btn.click(
+            fn=generate_all_insights,
+            outputs=[insights_output]
+        )
+        generate_custom_btn.click(
+            fn=generate_custom_insight,
+            inputs=[insight_type, insight_column, insight_column2],
+            outputs=[insights_output]
+        )
+    return generate_all_btn, insight_type, insights_output
+# ============================================================================
+# TAB 6: EXPORT
+# ============================================================================
+def create_export_tab():
+    """Create data export tab."""
+    with gr.Tab("💾 Export"):
+        gr.Markdown("## Export Data & Visualizations")
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown("### Export Filtered Data")
+                export_format = gr.Radio(
+                    choices=["CSV", "Excel"],
+                    label="Export Format",
+                    value="CSV"
+                )
+                export_data_btn = gr.Button("📥 Export Data", variant="primary")
+                export_file = gr.File(label="Download File")
+                export_status = gr.Textbox(label="Status", lines=2, interactive=False)
+            with gr.Column():
+                gr.Markdown("### Export Instructions")
+                gr.Markdown("""
+                **Export Your Data:**
+                1. Apply any filters you want in the Filter tab
+                2. Select your preferred export format
+                3. Click 'Export Data' to download
+                **Note:** The export will include only the filtered data.
+                """)
+        def export_data(format_choice):
+            if app_state.filtered_df is None or app_state.filtered_df.empty:
+                return None, "⚠️ No data to export"
+            try:
+                import tempfile
+                if format_choice == "CSV":
+                    temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.csv')
+                    exporter = CSVExporter()
+                    exporter.export(app_state.filtered_df, temp_file.name)
+                    status = f"✅ Exported {len(app_state.filtered_df)} rows to CSV"
+                else:  # Excel
+                    temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.xlsx')
+                    exporter = ExcelExporter()
+                    exporter.export(app_state.filtered_df, temp_file.name)
+                    status = f"✅ Exported {len(app_state.filtered_df)} rows to Excel"
+                return temp_file.name, status
+            except Exception as e:
+                logger.error(f"Error exporting data: {e}")
+                return None, f"❌ Error: {str(e)}"
+        export_data_btn.click(
+            fn=export_data,
+            inputs=[export_format],
+            outputs=[export_file, export_status]
+        )
+    return export_data_btn, export_file, export_status
+# ============================================================================
+# MAIN APPLICATION
+# ============================================================================
+def create_dashboard():
+    """Create the main Business Intelligence Dashboard."""
+    with gr.Blocks(title="Business Intelligence Dashboard") as demo:
+        # Header
+        gr.Markdown("""
+        # 📊 Business Intelligence Dashboard
+        ### Explore, Analyze, and Extract Insights from Your Data
+        **Features:** Smart Visualizations | Automated Insights | Interactive Filtering | Data Export
+        """)
+        # Create all tabs and capture their components
+        with gr.Tabs():
+            # Tab 1: Dataset Selection
+            (dataset_dropdown, status_box, dataset_info, data_preview,
+             load_btn, upload_btn) = create_dataset_tab()
+            # Tab 2: Statistics
+            (profile_btn, missing_values, numerical_summary,
+             categorical_summary, correlation_plot) = create_statistics_tab()
+            # Tab 3: Filter
+            (filter_type, column_select, filter_status, row_count,
+             filtered_preview, add_filter_btn, clear_filters_btn) = create_filter_tab()
+            # Tab 4: Visualizations
+            (viz_type, recommend_btn, recommendations_output, rec_buttons_row,
+             rec_btn_1, rec_btn_2, rec_btn_3, rec_viz_output, rec_status,
+             viz_output, viz_status, create_viz_btn) = create_visualization_tab()
+            # Tab 5: Insights
+            (generate_all_btn, insight_type, insights_output) = create_insights_tab()
+            # Tab 6: Export
+            (export_btn, export_file, export_status_export) = create_export_tab()
+        # Footer
+        gr.Markdown("""
+        ---
+        **Business Intelligence Dashboard** | Built with Gradio, Pandas, Matplotlib, and Plotly
+        *Tip: Start by loading a dataset from the Dataset Selection tab!*
+        """)
+        # Connect load button to reset all tabs
+        def load_and_reset(dataset_name):
+            # Load dataset
+            if not dataset_name:
+                return (
+                    None, "⚠️ Please select a dataset", "", None,
+                    None, None, "", None,
+                    "No filters applied", "0 rows", None,
+                    "Click the button to get recommendations",
+                    None, None,
+                    ""
+                )
+            df, status = app_state.load_dataset(dataset_name)
+            if df is not None:
+                info = f"📊 **Dataset**: {dataset_name}\n"
+                info += f"📏 **Shape**: {df.shape[0]} rows × {df.shape[1]} columns\n"
+                info += f"💾 **Memory**: {df.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB\n\n"
+                info += f"**Column Types**:\n"
+                col_types = get_column_types(df)
+                info += f"- Numerical: {len(col_types['numerical'])}\n"
+                info += f"- Categorical: {len(col_types['categorical'])}\n"
+                info += f"- DateTime: {len(col_types['datetime'])}\n"
+                preview = df.head(100)
+                return (
+                    dataset_name, status, info, preview,
+                    None, None, "", None,
+                    "No filters applied", "0 rows", None,
+                    "Click the button to get recommendations",
+                    None, None,
+                    ""
+                )
+            return (
+                None, status, "", None,
+                None, None, "", None,
+                "No filters applied", "0 rows", None,
+                "Click the button to get recommendations",
+                None, None,
+                ""
+            )
+        load_btn.click(
+            fn=load_and_reset,
+            inputs=[dataset_dropdown],
+            outputs=[
+                dataset_dropdown, status_box, dataset_info, data_preview,
+                missing_values, numerical_summary, categorical_summary, correlation_plot,
+                filter_status, row_count, filtered_preview,
+                recommendations_output,
+                viz_output, viz_status,
+                insights_output
+            ]
+        )
+    return demo
+# ============================================================================
+# LAUNCH APPLICATION
+# ============================================================================
+if __name__ == "__main__":
+    logger.info("Starting Business Intelligence Dashboard...")
+    # Create and launch dashboard
+    demo = create_dashboard()
+    demo.launch(
+        share=False,
+        show_error=True
+    )

data/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

data/Airbnb.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ecb59a7598d2aaf7dc2ed00c724648a319d93a916a3d4767e2bed0dbe0f1a7f8
+size 35913454

data/Online_Retail.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:43465a06f2ccf7c8b5bd2892bc7defb52f97487934fe93b16ae4c3936424676d
+size 23715344

data_processor.py ADDED Viewed

	@@ -0,0 +1,819 @@

+"""
+Data Processor Module for Business Intelligence Dashboard
+This module handles all data loading, cleaning, validation, and filtering operations.
+Implements SOLID principles with Strategy Pattern for flexible data processing.
+Author: Craig
+Date: December 2024
+"""
+import pandas as pd
+import numpy as np
+from pathlib import Path
+from typing import Union, List, Dict, Optional, Any, Tuple
+from abc import ABC, abstractmethod
+import logging
+from datetime import datetime
+from utils import (
+    FileValidator, DataFrameValidator, ColumnValidator,
+    get_column_types, detect_date_columns, clean_currency_column,
+    Config
+)
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# ============================================================================
+# STRATEGY PATTERN - Data Loading Strategies
+# Follows Open/Closed Principle and Strategy Pattern
+# ============================================================================
+class DataLoadStrategy(ABC):
+    """
+    Abstract base class for data loading strategies.
+    Follows Strategy Pattern - allows different loading algorithms to be selected at runtime.
+    """
+    @abstractmethod
+    def load(self, filepath: Union[str, Path]) -> pd.DataFrame:
+        """
+        Load data from file.
+        Args:
+            filepath: Path to the data file
+        Returns:
+            pd.DataFrame: Loaded data
+        """
+        pass
+    @abstractmethod
+    def can_handle(self, filepath: Union[str, Path]) -> bool:
+        """
+        Check if this strategy can handle the given file.
+        Args:
+            filepath: Path to check
+        Returns:
+            bool: True if this strategy can handle the file
+        """
+        pass
+class CSVLoadStrategy(DataLoadStrategy):
+    """
+    Strategy for loading CSV files.
+    Follows Single Responsibility Principle - only handles CSV loading.
+    """
+    def can_handle(self, filepath: Union[str, Path]) -> bool:
+        """Check if file is CSV format."""
+        return str(filepath).lower().endswith('.csv')
+    def load(self, filepath: Union[str, Path]) -> pd.DataFrame:
+        """
+        Load CSV file with automatic encoding detection.
+        Args:
+            filepath: Path to CSV file
+        Returns:
+            pd.DataFrame: Loaded data
+        Raises:
+            Exception: If loading fails
+        """
+        try:
+            # Try UTF-8 first
+            df = pd.read_csv(filepath, encoding='utf-8')
+            logger.info(f"Successfully loaded CSV file: {filepath}")
+            return df
+        except UnicodeDecodeError:
+            try:
+                # Fallback to latin-1
+                df = pd.read_csv(filepath, encoding='latin-1')
+                logger.info(f"Successfully loaded CSV file with latin-1 encoding: {filepath}")
+                return df
+            except Exception as e:
+                logger.error(f"Error loading CSV file: {e}")
+                raise Exception(f"Failed to load CSV file: {str(e)}")
+class ExcelLoadStrategy(DataLoadStrategy):
+    """
+    Strategy for loading Excel files.
+    Follows Single Responsibility Principle - only handles Excel loading.
+    """
+    def can_handle(self, filepath: Union[str, Path]) -> bool:
+        """Check if file is Excel format."""
+        extension = str(filepath).lower()
+        return extension.endswith('.xlsx') or extension.endswith('.xls')
+    def load(self, filepath: Union[str, Path]) -> pd.DataFrame:
+        """
+        Load Excel file.
+        Args:
+            filepath: Path to Excel file
+        Returns:
+            pd.DataFrame: Loaded data
+        Raises:
+            Exception: If loading fails
+        """
+        try:
+            df = pd.read_excel(filepath, engine='openpyxl')
+            logger.info(f"Successfully loaded Excel file: {filepath}")
+            return df
+        except Exception as e:
+            logger.error(f"Error loading Excel file: {e}")
+            raise Exception(f"Failed to load Excel file: {str(e)}")
+class JSONLoadStrategy(DataLoadStrategy):
+    """
+    Strategy for loading JSON files.
+    Follows Single Responsibility Principle - only handles JSON loading.
+    """
+    def can_handle(self, filepath: Union[str, Path]) -> bool:
+        """Check if file is JSON format."""
+        return str(filepath).lower().endswith('.json')
+    def load(self, filepath: Union[str, Path]) -> pd.DataFrame:
+        """
+        Load JSON file.
+        Args:
+            filepath: Path to JSON file
+        Returns:
+            pd.DataFrame: Loaded data
+        Raises:
+            Exception: If loading fails
+        """
+        try:
+            df = pd.read_json(filepath)
+            logger.info(f"Successfully loaded JSON file: {filepath}")
+            return df
+        except Exception as e:
+            logger.error(f"Error loading JSON file: {e}")
+            raise Exception(f"Failed to load JSON file: {str(e)}")
+class ParquetLoadStrategy(DataLoadStrategy):
+    """
+    Strategy for loading Parquet files.
+    Follows Single Responsibility Principle - only handles Parquet loading.
+    """
+    def can_handle(self, filepath: Union[str, Path]) -> bool:
+        """Check if file is Parquet format."""
+        return str(filepath).lower().endswith('.parquet')
+    def load(self, filepath: Union[str, Path]) -> pd.DataFrame:
+        """
+        Load Parquet file.
+        Args:
+            filepath: Path to Parquet file
+        Returns:
+            pd.DataFrame: Loaded data
+        Raises:
+            Exception: If loading fails
+        """
+        try:
+            df = pd.read_parquet(filepath)
+            logger.info(f"Successfully loaded Parquet file: {filepath}")
+            return df
+        except Exception as e:
+            logger.error(f"Error loading Parquet file: {e}")
+            raise Exception(f"Failed to load Parquet file: {str(e)}")
+# ============================================================================
+# DATA LOADER CONTEXT
+# Uses Strategy Pattern to select appropriate loading strategy
+# ============================================================================
+class DataLoader:
+    """
+    Context class for data loading using Strategy Pattern.
+    Automatically selects the appropriate loading strategy based on file type.
+    Follows Open/Closed Principle - open for extension (new strategies), closed for modification.
+    """
+    def __init__(self):
+        """Initialize DataLoader with all available strategies."""
+        self.strategies: List[DataLoadStrategy] = [
+            CSVLoadStrategy(),
+            ExcelLoadStrategy(),
+            JSONLoadStrategy(),
+            ParquetLoadStrategy()
+        ]
+        self.file_validator = FileValidator()
+    def load_data(self, filepath: Union[str, Path]) -> pd.DataFrame:
+        """
+        Load data using appropriate strategy.
+        Args:
+            filepath: Path to data file
+        Returns:
+            pd.DataFrame: Loaded data
+        Raises:
+            FileNotFoundError: If file doesn't exist
+            ValueError: If file format is not supported
+            Exception: If loading fails
+        """
+        # Validate file
+        self.file_validator.validate(filepath)
+        # Find appropriate strategy
+        for strategy in self.strategies:
+            if strategy.can_handle(filepath):
+                df = strategy.load(filepath)
+                logger.info(f"Loaded {len(df)} rows and {len(df.columns)} columns")
+                return df
+        # No strategy found
+        raise ValueError(f"No loading strategy available for file: {filepath}")
+    def add_strategy(self, strategy: DataLoadStrategy) -> None:
+        """
+        Add a new loading strategy.
+        Follows Open/Closed Principle - extend functionality without modifying existing code.
+        Args:
+            strategy: New loading strategy to add
+        """
+        self.strategies.append(strategy)
+        logger.info(f"Added new loading strategy: {strategy.__class__.__name__}")
+# ============================================================================
+# DATA CLEANING
+# Follows Single Responsibility Principle
+# ============================================================================
+class DataCleaner:
+    """
+    Handles data cleaning operations.
+    Follows Single Responsibility Principle - only responsible for cleaning data.
+    """
+    @staticmethod
+    def handle_missing_values(df: pd.DataFrame, strategy: str = 'none') -> pd.DataFrame:
+        """
+        Handle missing values in DataFrame.
+        Args:
+            df: DataFrame to clean
+            strategy: Strategy for handling missing values
+                     'none' - do nothing
+                     'drop' - drop rows with any missing values
+                     'fill_mean' - fill numerical columns with mean
+                     'fill_median' - fill numerical columns with median
+                     'fill_mode' - fill categorical columns with mode
+        Returns:
+            pd.DataFrame: Cleaned DataFrame
+        """
+        if strategy == 'none':
+            return df.copy()
+        df_cleaned = df.copy()
+        if strategy == 'drop':
+            df_cleaned = df_cleaned.dropna()
+            logger.info(f"Dropped rows with missing values. Remaining rows: {len(df_cleaned)}")
+        elif strategy == 'fill_mean':
+            numerical_cols = df_cleaned.select_dtypes(include=[np.number]).columns
+            for col in numerical_cols:
+                df_cleaned[col].fillna(df_cleaned[col].mean(), inplace=True)
+            logger.info(f"Filled missing values with mean for {len(numerical_cols)} columns")
+        elif strategy == 'fill_median':
+            numerical_cols = df_cleaned.select_dtypes(include=[np.number]).columns
+            for col in numerical_cols:
+                df_cleaned[col].fillna(df_cleaned[col].median(), inplace=True)
+            logger.info(f"Filled missing values with median for {len(numerical_cols)} columns")
+        elif strategy == 'fill_mode':
+            for col in df_cleaned.columns:
+                if df_cleaned[col].dtype == 'object':
+                    mode_value = df_cleaned[col].mode()
+                    if len(mode_value) > 0:
+                        df_cleaned[col].fillna(mode_value[0], inplace=True)
+            logger.info("Filled missing values with mode for categorical columns")
+        return df_cleaned
+    @staticmethod
+    def convert_data_types(df: pd.DataFrame) -> pd.DataFrame:
+        """
+        Automatically convert data types (dates, currencies, etc.).
+        Args:
+            df: DataFrame to convert
+        Returns:
+            pd.DataFrame: DataFrame with converted types
+        """
+        df_converted = df.copy()
+        # Detect and convert date columns
+        date_columns = detect_date_columns(df_converted)
+        for col in date_columns:
+            try:
+                df_converted[col] = pd.to_datetime(df_converted[col], errors='coerce')
+                logger.info(f"Converted column '{col}' to datetime")
+            except Exception as e:
+                logger.warning(f"Could not convert '{col}' to datetime: {e}")
+        # Detect and convert currency columns
+        for col in df_converted.select_dtypes(include=['object']).columns:
+            # Check if column contains currency symbols
+            sample = df_converted[col].dropna().head(10).astype(str)
+            if any(any(symbol in str(val) for symbol in ['$', '€', '£', '¥']) for val in sample):
+                try:
+                    df_converted[col] = clean_currency_column(df_converted[col])
+                    logger.info(f"Converted column '{col}' from currency to numeric")
+                except Exception as e:
+                    logger.warning(f"Could not convert '{col}' from currency: {e}")
+        # Convert boolean strings to actual booleans
+        for col in df_converted.select_dtypes(include=['object']).columns:
+            unique_values = df_converted[col].dropna().unique()
+            if len(unique_values) <= 2 and all(
+                    str(v).upper() in ['TRUE', 'FALSE', 'YES', 'NO', '0', '1'] for v in unique_values):
+                try:
+                    df_converted[col] = df_converted[col].map({
+                        'TRUE': True, 'FALSE': False,
+                        'YES': True, 'NO': False,
+                        'True': True, 'False': False,
+                        'true': True, 'false': False,
+                        '1': True, '0': False
+                    })
+                    logger.info(f"Converted column '{col}' to boolean")
+                except Exception as e:
+                    logger.warning(f"Could not convert '{col}' to boolean: {e}")
+        return df_converted
+    @staticmethod
+    def remove_duplicates(df: pd.DataFrame) -> pd.DataFrame:
+        """
+        Remove duplicate rows from DataFrame.
+        Args:
+            df: DataFrame to clean
+        Returns:
+            pd.DataFrame: DataFrame without duplicates
+        """
+        initial_rows = len(df)
+        df_cleaned = df.drop_duplicates()
+        removed_rows = initial_rows - len(df_cleaned)
+        if removed_rows > 0:
+            logger.info(f"Removed {removed_rows} duplicate rows")
+        return df_cleaned
+    @staticmethod
+    def handle_outliers(df: pd.DataFrame, columns: List[str], method: str = 'zscore',
+                        threshold: float = 3.0) -> pd.DataFrame:
+        """
+        Handle outliers in numerical columns.
+        Args:
+            df: DataFrame to process
+            columns: List of columns to check for outliers
+            method: Method for outlier detection ('zscore' or 'iqr')
+            threshold: Threshold for outlier detection
+        Returns:
+            pd.DataFrame: DataFrame with outliers handled
+        """
+        df_cleaned = df.copy()
+        for col in columns:
+            if col not in df_cleaned.columns:
+                continue
+            if not pd.api.types.is_numeric_dtype(df_cleaned[col]):
+                continue
+            if method == 'zscore':
+                # Z-score method
+                z_scores = np.abs((df_cleaned[col] - df_cleaned[col].mean()) / df_cleaned[col].std())
+                df_cleaned = df_cleaned[z_scores < threshold]
+            elif method == 'iqr':
+                # IQR method
+                Q1 = df_cleaned[col].quantile(0.25)
+                Q3 = df_cleaned[col].quantile(0.75)
+                IQR = Q3 - Q1
+                lower_bound = Q1 - threshold * IQR
+                upper_bound = Q3 + threshold * IQR
+                df_cleaned = df_cleaned[(df_cleaned[col] >= lower_bound) & (df_cleaned[col] <= upper_bound)]
+        removed_rows = len(df) - len(df_cleaned)
+        if removed_rows > 0:
+            logger.info(f"Removed {removed_rows} outlier rows")
+        return df_cleaned
+# ============================================================================
+# DATA PROFILER
+# Generates comprehensive statistics about the dataset
+# ============================================================================
+class DataProfiler:
+    """
+    Generates comprehensive data profiling statistics.
+    Follows Single Responsibility Principle - only responsible for profiling.
+    """
+    def __init__(self, df: pd.DataFrame):
+        """
+        Initialize profiler with DataFrame.
+        Args:
+            df: DataFrame to profile
+        """
+        self.df = df
+        self.validator = DataFrameValidator()
+        self.validator.validate(df)
+    def get_basic_info(self) -> Dict[str, Any]:
+        """
+        Get basic information about the dataset.
+        Returns:
+            Dict with shape, columns, data types, and memory usage
+        """
+        return {
+            'rows': len(self.df),
+            'columns': len(self.df.columns),
+            'column_names': self.df.columns.tolist(),
+            'data_types': self.df.dtypes.to_dict(),
+            'memory_usage': f"{self.df.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB"
+        }
+    def get_missing_values_report(self) -> pd.DataFrame:
+        """
+        Generate report on missing values.
+        Returns:
+            DataFrame with missing value statistics per column
+        """
+        missing_data = pd.DataFrame({
+            'Column': self.df.columns,
+            'Missing_Count': self.df.isnull().sum().values,
+            'Missing_Percentage': (self.df.isnull().sum().values / len(self.df) * 100).round(2)
+        })
+        return missing_data[missing_data['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
+    def get_numerical_summary(self) -> pd.DataFrame:
+        """
+        Get summary statistics for numerical columns.
+        Returns:
+            DataFrame with descriptive statistics
+        """
+        numerical_cols = self.df.select_dtypes(include=[np.number]).columns
+        if len(numerical_cols) == 0:
+            return pd.DataFrame()
+        return self.df[numerical_cols].describe()
+    def get_categorical_summary(self) -> Dict[str, Dict[str, Any]]:
+        """
+        Get summary statistics for categorical columns.
+        Returns:
+            Dict with statistics for each categorical column
+        """
+        categorical_cols = self.df.select_dtypes(include=['object', 'category']).columns
+        summary = {}
+        for col in categorical_cols:
+            # Get value counts, dropping NaN values
+            value_counts = self.df[col].value_counts()
+            # Safely get mode
+            mode_values = self.df[col].mode()
+            top_value = mode_values.iloc[0] if len(mode_values) > 0 and not mode_values.empty else None
+            # Safely get top frequency
+            top_freq = value_counts.iloc[0] if len(value_counts) > 0 else 0
+            summary[col] = {
+                'unique_count': self.df[col].nunique(),
+                'top_value': top_value,
+                'top_value_frequency': top_freq,
+                'value_counts': value_counts.head(10).to_dict()
+            }
+        return summary
+    def get_correlation_matrix(self) -> pd.DataFrame:
+        """
+        Get correlation matrix for numerical columns.
+        Returns:
+            Correlation matrix DataFrame
+        """
+        numerical_cols = self.df.select_dtypes(include=[np.number]).columns
+        if len(numerical_cols) < 2:
+            return pd.DataFrame()
+        return self.df[numerical_cols].corr()
+    def get_full_profile(self) -> Dict[str, Any]:
+        """
+        Get comprehensive data profile.
+        Returns:
+            Dict with all profiling information
+        """
+        return {
+            'basic_info': self.get_basic_info(),
+            'missing_values': self.get_missing_values_report(),
+            'numerical_summary': self.get_numerical_summary(),
+            'categorical_summary': self.get_categorical_summary(),
+            'correlation_matrix': self.get_correlation_matrix()
+        }
+# ============================================================================
+# DATA FILTER
+# Handles interactive filtering operations
+# ============================================================================
+class DataFilter:
+    """
+    Handles data filtering operations.
+    Follows Single Responsibility Principle - only responsible for filtering.
+    """
+    @staticmethod
+    def filter_numerical(df: pd.DataFrame, column: str, min_val: Optional[float] = None,
+                         max_val: Optional[float] = None) -> pd.DataFrame:
+        """
+        Filter DataFrame by numerical column range.
+        Args:
+            df: DataFrame to filter
+            column: Column name to filter
+            min_val: Minimum value (inclusive)
+            max_val: Maximum value (inclusive)
+        Returns:
+            Filtered DataFrame
+        """
+        ColumnValidator().validate(df, column)
+        filtered_df = df.copy()
+        if min_val is not None:
+            filtered_df = filtered_df[filtered_df[column] >= min_val]
+        if max_val is not None:
+            filtered_df = filtered_df[filtered_df[column] <= max_val]
+        logger.info(f"Filtered by {column}: {len(filtered_df)} rows remaining")
+        return filtered_df
+    @staticmethod
+    def filter_categorical(df: pd.DataFrame, column: str, values: List[Any]) -> pd.DataFrame:
+        """
+        Filter DataFrame by categorical column values.
+        Args:
+            df: DataFrame to filter
+            column: Column name to filter
+            values: List of values to keep
+        Returns:
+            Filtered DataFrame
+        """
+        ColumnValidator().validate(df, column)
+        if not values:
+            return df.copy()
+        filtered_df = df[df[column].isin(values)]
+        logger.info(f"Filtered by {column}: {len(filtered_df)} rows remaining")
+        return filtered_df
+    @staticmethod
+    def filter_date_range(df: pd.DataFrame, column: str, start_date: Optional[datetime] = None,
+                          end_date: Optional[datetime] = None) -> pd.DataFrame:
+        """
+        Filter DataFrame by date range.
+        Args:
+            df: DataFrame to filter
+            column: Date column name
+            start_date: Start date (inclusive)
+            end_date: End date (inclusive)
+        Returns:
+            Filtered DataFrame
+        """
+        ColumnValidator().validate(df, column)
+        filtered_df = df.copy()
+        # Ensure column is datetime
+        if not pd.api.types.is_datetime64_any_dtype(filtered_df[column]):
+            filtered_df[column] = pd.to_datetime(filtered_df[column], errors='coerce')
+        if start_date is not None:
+            filtered_df = filtered_df[filtered_df[column] >= start_date]
+        if end_date is not None:
+            filtered_df = filtered_df[filtered_df[column] <= end_date]
+        logger.info(f"Filtered by date range on {column}: {len(filtered_df)} rows remaining")
+        return filtered_df
+    @staticmethod
+    def apply_multiple_filters(df: pd.DataFrame, filters: List[Dict[str, Any]]) -> pd.DataFrame:
+        """
+        Apply multiple filters sequentially.
+        Args:
+            df: DataFrame to filter
+            filters: List of filter dictionaries with keys:
+                    - 'type': 'numerical', 'categorical', or 'date'
+                    - 'column': column name
+                    - other keys depending on filter type
+        Returns:
+            Filtered DataFrame
+        """
+        filtered_df = df.copy()
+        for filter_config in filters:
+            filter_type = filter_config.get('type')
+            column = filter_config.get('column')
+            if filter_type == 'numerical':
+                filtered_df = DataFilter.filter_numerical(
+                    filtered_df,
+                    column,
+                    filter_config.get('min_val'),
+                    filter_config.get('max_val')
+                )
+            elif filter_type == 'categorical':
+                filtered_df = DataFilter.filter_categorical(
+                    filtered_df,
+                    column,
+                    filter_config.get('values', [])
+                )
+            elif filter_type == 'date':
+                filtered_df = DataFilter.filter_date_range(
+                    filtered_df,
+                    column,
+                    filter_config.get('start_date'),
+                    filter_config.get('end_date')
+                )
+        return filtered_df
+# ============================================================================
+# MAIN DATA PROCESSOR CLASS
+# Facade pattern - provides simple interface to complex subsystems
+# ============================================================================
+class DataProcessor:
+    """
+    Main data processor class using Facade pattern.
+    Provides simple interface to complex data loading, cleaning, and filtering operations.
+    Follows Dependency Inversion Principle - depends on abstractions, not concrete implementations.
+    """
+    def __init__(self):
+        """Initialize DataProcessor with all components."""
+        self.loader = DataLoader()
+        self.cleaner = DataCleaner()
+        self.filter = DataFilter()
+        self.current_df: Optional[pd.DataFrame] = None
+        self.original_df: Optional[pd.DataFrame] = None
+        self.profiler: Optional[DataProfiler] = None
+    def load_and_prepare_data(self, filepath: Union[str, Path],
+                              clean: bool = True,
+                              remove_duplicates: bool = True) -> pd.DataFrame:
+        """
+        Load and prepare data with automatic cleaning.
+        Args:
+            filepath: Path to data file
+            clean: Whether to apply automatic type conversion
+            remove_duplicates: Whether to remove duplicate rows
+        Returns:
+            Prepared DataFrame
+        """
+        # Load data
+        df = self.loader.load_data(filepath)
+        self.original_df = df.copy()
+        # Clean data
+        if clean:
+            df = self.cleaner.convert_data_types(df)
+        if remove_duplicates:
+            df = self.cleaner.remove_duplicates(df)
+        self.current_df = df
+        self.profiler = DataProfiler(df)
+        logger.info("Data loaded and prepared successfully")
+        return df
+    def get_data_profile(self) -> Dict[str, Any]:
+        """
+        Get comprehensive data profile.
+        Returns:
+            Dict with profiling information
+        """
+        if self.profiler is None:
+            raise ValueError("No data loaded. Call load_and_prepare_data first.")
+        return self.profiler.get_full_profile()
+    def apply_filters(self, filters: List[Dict[str, Any]]) -> pd.DataFrame:
+        """
+        Apply filters to current data.
+        Args:
+            filters: List of filter configurations
+        Returns:
+            Filtered DataFrame
+        """
+        if self.current_df is None:
+            raise ValueError("No data loaded. Call load_and_prepare_data first.")
+        return self.filter.apply_multiple_filters(self.current_df, filters)
+    def reset_to_original(self) -> pd.DataFrame:
+        """
+        Reset current data to original loaded data.
+        Returns:
+            Original DataFrame
+        """
+        if self.original_df is None:
+            raise ValueError("No data loaded. Call load_and_prepare_data first.")
+        self.current_df = self.original_df.copy()
+        return self.current_df
+    def get_column_info(self) -> Dict[str, List[str]]:
+        """
+        Get categorized column information.
+        Returns:
+            Dict with numerical, categorical, and datetime columns
+        """
+        if self.current_df is None:
+            raise ValueError("No data loaded. Call load_and_prepare_data first.")
+        return get_column_types(self.current_df)
+if __name__ == "__main__":
+    # Example usage
+    print("DataProcessor module loaded successfully")
+    # Demonstrate Strategy Pattern
+    processor = DataProcessor()
+    print(f"Available strategies: {len(processor.loader.strategies)}")

insights.py ADDED Viewed

	@@ -0,0 +1,897 @@

+"""
+Insights Module for Business Intelligence Dashboard
+This module handles automated insight generation from data.
+Uses Strategy Pattern for different types of insights.
+Author: Craig
+Date: December 2024
+"""
+import pandas as pd
+import numpy as np
+from typing import Union, List, Dict, Optional, Any, Tuple
+from abc import ABC, abstractmethod
+import logging
+from datetime import datetime, timedelta
+from utils import (
+    DataFrameValidator, ColumnValidator,
+    format_number, format_percentage, safe_divide,
+    get_column_types
+)
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# ============================================================================
+# STRATEGY PATTERN - Insight Strategies
+# Follows Open/Closed Principle and Strategy Pattern
+# ============================================================================
+class InsightStrategy(ABC):
+    """
+    Abstract base class for insight generation strategies.
+    Follows Strategy Pattern - allows different insight algorithms.
+    """
+    @abstractmethod
+    def generate(self, df: pd.DataFrame, **kwargs) -> Dict[str, Any]:
+        """
+        Generate insights from data.
+        Args:
+            df: DataFrame to analyze
+            **kwargs: Additional parameters for insight generation
+        Returns:
+            Dict containing insight information
+        """
+        pass
+    @abstractmethod
+    def get_insight_type(self) -> str:
+        """
+        Get the type of insight this strategy generates.
+        Returns:
+            str: Insight type name
+        """
+        pass
+# ============================================================================
+# TOP/BOTTOM PERFORMERS INSIGHTS
+# ============================================================================
+class TopBottomPerformers(InsightStrategy):
+    """
+    Identify top and bottom performers in the data.
+    Follows Single Responsibility Principle - only handles top/bottom analysis.
+    """
+    def get_insight_type(self) -> str:
+        """Get insight type."""
+        return "top_bottom_performers"
+    def generate(self, df: pd.DataFrame,
+                 column: str,
+                 group_by: Optional[str] = None,
+                 top_n: int = 5,
+                 bottom_n: int = 5,
+                 aggregation: str = 'sum',
+                 **kwargs) -> Dict[str, Any]:
+        """
+        Generate top and bottom performer insights.
+        Args:
+            df: DataFrame to analyze
+            column: Column to analyze for performance
+            group_by: Optional column to group by
+            top_n: Number of top performers to identify
+            bottom_n: Number of bottom performers to identify
+            aggregation: Aggregation method if group_by is used
+            **kwargs: Additional parameters
+        Returns:
+            Dict with top and bottom performers
+        """
+        # Validate inputs
+        DataFrameValidator().validate(df)
+        ColumnValidator().validate(df, column)
+        if group_by:
+            ColumnValidator().validate(df, group_by)
+            # Aggregate by group
+            if aggregation == 'sum':
+                data = df.groupby(group_by)[column].sum().sort_values(ascending=False)
+            elif aggregation == 'mean':
+                data = df.groupby(group_by)[column].mean().sort_values(ascending=False)
+            elif aggregation == 'count':
+                data = df.groupby(group_by)[column].count().sort_values(ascending=False)
+            elif aggregation == 'median':
+                data = df.groupby(group_by)[column].median().sort_values(ascending=False)
+            else:
+                data = df.groupby(group_by)[column].sum().sort_values(ascending=False)
+        else:
+            # Direct analysis on column
+            data = df[column].sort_values(ascending=False)
+        # Get top and bottom performers
+        top_performers = data.head(top_n)
+        bottom_performers = data.tail(bottom_n).sort_values(ascending=True)
+        # Calculate statistics
+        total = data.sum()
+        top_contribution = safe_divide(top_performers.sum(), total) if total != 0 else 0
+        bottom_contribution = safe_divide(bottom_performers.sum(), total) if total != 0 else 0
+        insight = {
+            'type': self.get_insight_type(),
+            'column': column,
+            'group_by': group_by,
+            'aggregation': aggregation if group_by else 'direct',
+            'top_performers': {
+                'data': top_performers.to_dict(),
+                'count': len(top_performers),
+                'total_value': top_performers.sum(),
+                'contribution_percentage': top_contribution
+            },
+            'bottom_performers': {
+                'data': bottom_performers.to_dict(),
+                'count': len(bottom_performers),
+                'total_value': bottom_performers.sum(),
+                'contribution_percentage': bottom_contribution
+            },
+            'summary': self._generate_summary(
+                column, group_by, top_performers, bottom_performers,
+                top_contribution, bottom_contribution
+            )
+        }
+        logger.info(f"Generated top/bottom performers insight for {column}")
+        return insight
+    def _generate_summary(self, column: str, group_by: Optional[str],
+                          top: pd.Series, bottom: pd.Series,
+                          top_contrib: float, bottom_contrib: float) -> str:
+        """Generate human-readable summary."""
+        if group_by:
+            top_name = top.index[0] if len(top) > 0 else "N/A"
+            bottom_name = bottom.index[0] if len(bottom) > 0 else "N/A"
+            summary = f"Top performer in {column}: '{top_name}' with {format_number(top.iloc[0])}. "
+            summary += f"Bottom performer: '{bottom_name}' with {format_number(bottom.iloc[0])}. "
+            summary += f"Top {len(top)} performers contribute {format_percentage(top_contrib)} of total."
+        else:
+            summary = f"Highest value in {column}: {format_number(top.iloc[0])}. "
+            summary += f"Lowest value: {format_number(bottom.iloc[0])}. "
+            summary += f"Range: {format_number(top.iloc[0] - bottom.iloc[0])}"
+        return summary
+# ============================================================================
+# TREND ANALYSIS INSIGHTS
+# ============================================================================
+class TrendAnalysis(InsightStrategy):
+    """
+    Analyze trends in time series data.
+    Follows Single Responsibility Principle - only handles trend analysis.
+    """
+    def get_insight_type(self) -> str:
+        """Get insight type."""
+        return "trend_analysis"
+    def generate(self, df: pd.DataFrame,
+                 date_column: str,
+                 value_column: str,
+                 period: str = 'overall',
+                 **kwargs) -> Dict[str, Any]:
+        """
+        Generate trend analysis insights.
+        Args:
+            df: DataFrame to analyze
+            date_column: Column containing dates
+            value_column: Column containing values
+            period: Analysis period ('overall', 'monthly', 'weekly', 'daily')
+            **kwargs: Additional parameters
+        Returns:
+            Dict with trend insights
+        """
+        # Validate inputs
+        DataFrameValidator().validate(df)
+        ColumnValidator().validate(df, [date_column, value_column])
+        # Prepare data
+        df_trend = df[[date_column, value_column]].copy()
+        # Ensure date column is datetime
+        if not pd.api.types.is_datetime64_any_dtype(df_trend[date_column]):
+            df_trend[date_column] = pd.to_datetime(df_trend[date_column], errors='coerce')
+        # Remove NaN values
+        df_trend = df_trend.dropna()
+        if len(df_trend) < 2:
+            return {
+                'type': self.get_insight_type(),
+                'error': 'Insufficient data for trend analysis',
+                'summary': 'Not enough data points to analyze trends.'
+            }
+        # Sort by date
+        df_trend = df_trend.sort_values(date_column)
+        # Calculate trend metrics
+        first_value = df_trend[value_column].iloc[0]
+        last_value = df_trend[value_column].iloc[-1]
+        change = last_value - first_value
+        change_pct = safe_divide(change, first_value)
+        # Determine trend direction
+        if change > 0:
+            trend_direction = 'increasing'
+        elif change < 0:
+            trend_direction = 'decreasing'
+        else:
+            trend_direction = 'stable'
+        # Calculate statistics
+        mean_value = df_trend[value_column].mean()
+        median_value = df_trend[value_column].median()
+        std_value = df_trend[value_column].std()
+        # Calculate growth rate (if applicable)
+        growth_rate = self._calculate_growth_rate(df_trend, date_column, value_column)
+        # Detect volatility
+        volatility = self._calculate_volatility(df_trend[value_column])
+        insight = {
+            'type': self.get_insight_type(),
+            'date_column': date_column,
+            'value_column': value_column,
+            'period': period,
+            'trend_direction': trend_direction,
+            'metrics': {
+                'first_value': first_value,
+                'last_value': last_value,
+                'absolute_change': change,
+                'percentage_change': change_pct,
+                'mean': mean_value,
+                'median': median_value,
+                'std_deviation': std_value,
+                'growth_rate': growth_rate,
+                'volatility': volatility
+            },
+            'date_range': {
+                'start': df_trend[date_column].min().strftime('%Y-%m-%d'),
+                'end': df_trend[date_column].max().strftime('%Y-%m-%d'),
+                'days': (df_trend[date_column].max() - df_trend[date_column].min()).days
+            },
+            'summary': self._generate_summary(
+                value_column, trend_direction, change, change_pct, volatility
+            )
+        }
+        logger.info(f"Generated trend analysis insight for {value_column}")
+        return insight
+    def _calculate_growth_rate(self, df: pd.DataFrame,
+                               date_col: str, value_col: str) -> Optional[float]:
+        """Calculate average growth rate."""
+        try:
+            # Simple linear regression for growth rate
+            x = (df[date_col] - df[date_col].min()).dt.days.values
+            y = df[value_col].values
+            if len(x) < 2:
+                return None
+            # Calculate slope
+            slope = np.polyfit(x, y, 1)[0]
+            return slope
+        except Exception:
+            return None
+    def _calculate_volatility(self, series: pd.Series) -> str:
+        """Calculate volatility level."""
+        if len(series) < 2:
+            return 'unknown'
+        # Use coefficient of variation
+        cv = safe_divide(series.std(), series.mean())
+        if cv < 0.1:
+            return 'low'
+        elif cv < 0.3:
+            return 'moderate'
+        else:
+            return 'high'
+    def _generate_summary(self, column: str, direction: str,
+                          change: float, change_pct: float, volatility: str) -> str:
+        """Generate human-readable summary."""
+        summary = f"{column} shows a {direction} trend with "
+        summary += f"{format_percentage(abs(change_pct))} {'increase' if change > 0 else 'decrease'}. "
+        summary += f"Absolute change: {format_number(change)}. "
+        summary += f"Volatility: {volatility}."
+        return summary
+# ============================================================================
+# ANOMALY DETECTION INSIGHTS
+# ============================================================================
+class AnomalyDetection(InsightStrategy):
+    """
+    Detect anomalies and outliers in data.
+    Follows Single Responsibility Principle - only handles anomaly detection.
+    """
+    def get_insight_type(self) -> str:
+        """Get insight type."""
+        return "anomaly_detection"
+    def generate(self, df: pd.DataFrame,
+                 column: str,
+                 method: str = 'zscore',
+                 threshold: float = 3.0,
+                 **kwargs) -> Dict[str, Any]:
+        """
+        Generate anomaly detection insights.
+        Args:
+            df: DataFrame to analyze
+            column: Column to analyze for anomalies
+            method: Detection method ('zscore' or 'iqr')
+            threshold: Threshold for anomaly detection
+            **kwargs: Additional parameters
+        Returns:
+            Dict with anomaly insights
+        """
+        # Validate inputs
+        DataFrameValidator().validate(df)
+        ColumnValidator().validate(df, column)
+        # Check if column is numerical
+        if not pd.api.types.is_numeric_dtype(df[column]):
+            return {
+                'type': self.get_insight_type(),
+                'error': f'Column {column} is not numerical',
+                'summary': f'Cannot detect anomalies in non-numerical column {column}.'
+            }
+        # Remove NaN values
+        data = df[column].dropna()
+        if len(data) < 3:
+            return {
+                'type': self.get_insight_type(),
+                'error': 'Insufficient data',
+                'summary': 'Not enough data points to detect anomalies.'
+            }
+        # Detect anomalies
+        if method == 'zscore':
+            anomalies_mask = self._detect_zscore(data, threshold)
+        elif method == 'iqr':
+            anomalies_mask = self._detect_iqr(data, threshold)
+        else:
+            raise ValueError(f"Unsupported method: {method}")
+        anomalies = data[anomalies_mask]
+        # Calculate statistics
+        total_points = len(data)
+        anomaly_count = len(anomalies)
+        anomaly_percentage = safe_divide(anomaly_count, total_points)
+        insight = {
+            'type': self.get_insight_type(),
+            'column': column,
+            'method': method,
+            'threshold': threshold,
+            'statistics': {
+                'total_points': total_points,
+                'anomaly_count': anomaly_count,
+                'anomaly_percentage': anomaly_percentage,
+                'mean': data.mean(),
+                'median': data.median(),
+                'std': data.std(),
+                'min': data.min(),
+                'max': data.max()
+            },
+            'anomalies': {
+                'values': anomalies.tolist()[:20],  # Limit to first 20
+                'max_anomaly': anomalies.max() if len(anomalies) > 0 else None,
+                'min_anomaly': anomalies.min() if len(anomalies) > 0 else None
+            },
+            'summary': self._generate_summary(
+                column, method, anomaly_count, anomaly_percentage,
+                anomalies.max() if len(anomalies) > 0 else None,
+                anomalies.min() if len(anomalies) > 0 else None
+            )
+        }
+        logger.info(f"Generated anomaly detection insight for {column}")
+        return insight
+    def _detect_zscore(self, series: pd.Series, threshold: float) -> pd.Series:
+        """Detect anomalies using Z-score method."""
+        z_scores = np.abs((series - series.mean()) / series.std())
+        return z_scores > threshold
+    def _detect_iqr(self, series: pd.Series, threshold: float) -> pd.Series:
+        """Detect anomalies using IQR method."""
+        Q1 = series.quantile(0.25)
+        Q3 = series.quantile(0.75)
+        IQR = Q3 - Q1
+        lower_bound = Q1 - threshold * IQR
+        upper_bound = Q3 + threshold * IQR
+        return (series < lower_bound) | (series > upper_bound)
+    def _generate_summary(self, column: str, method: str,
+                          count: int, percentage: float,
+                          max_anomaly: Optional[float],
+                          min_anomaly: Optional[float]) -> str:
+        """Generate human-readable summary."""
+        if count == 0:
+            return f"No anomalies detected in {column} using {method} method."
+        summary = f"Detected {count} anomalies ({format_percentage(percentage)}) in {column}. "
+        if max_anomaly and min_anomaly:
+            summary += f"Range of anomalies: {format_number(min_anomaly)} to {format_number(max_anomaly)}."
+        return summary
+# ============================================================================
+# DISTRIBUTION INSIGHTS
+# ============================================================================
+class DistributionInsights(InsightStrategy):
+    """
+    Analyze data distribution characteristics.
+    Follows Single Responsibility Principle - only handles distribution analysis.
+    """
+    def get_insight_type(self) -> str:
+        """Get insight type."""
+        return "distribution_insights"
+    def generate(self, df: pd.DataFrame,
+                 column: str,
+                 **kwargs) -> Dict[str, Any]:
+        """
+        Generate distribution insights.
+        Args:
+            df: DataFrame to analyze
+            column: Column to analyze
+            **kwargs: Additional parameters
+        Returns:
+            Dict with distribution insights
+        """
+        # Validate inputs
+        DataFrameValidator().validate(df)
+        ColumnValidator().validate(df, column)
+        # Check if column is numerical
+        if not pd.api.types.is_numeric_dtype(df[column]):
+            # For categorical columns
+            return self._categorical_distribution(df, column)
+        else:
+            # For numerical columns
+            return self._numerical_distribution(df, column)
+    def _numerical_distribution(self, df: pd.DataFrame, column: str) -> Dict[str, Any]:
+        """Analyze numerical distribution."""
+        data = df[column].dropna()
+        if len(data) == 0:
+            return {
+                'type': self.get_insight_type(),
+                'error': 'No valid data',
+                'summary': f'No valid data in column {column}.'
+            }
+        # Calculate statistics
+        statistics = {
+            'count': len(data),
+            'mean': data.mean(),
+            'median': data.median(),
+            'mode': data.mode()[0] if len(data.mode()) > 0 else None,
+            'std': data.std(),
+            'min': data.min(),
+            'max': data.max(),
+            'range': data.max() - data.min(),
+            'q1': data.quantile(0.25),
+            'q3': data.quantile(0.75),
+            'iqr': data.quantile(0.75) - data.quantile(0.25),
+            'skewness': data.skew(),
+            'kurtosis': data.kurtosis()
+        }
+        # Determine distribution shape
+        shape = self._determine_shape(statistics['skewness'], statistics['kurtosis'])
+        insight = {
+            'type': self.get_insight_type(),
+            'column': column,
+            'data_type': 'numerical',
+            'statistics': statistics,
+            'distribution_shape': shape,
+            'summary': self._generate_numerical_summary(column, statistics, shape)
+        }
+        logger.info(f"Generated distribution insight for {column}")
+        return insight
+    def _categorical_distribution(self, df: pd.DataFrame, column: str) -> Dict[str, Any]:
+        """Analyze categorical distribution."""
+        data = df[column].dropna()
+        if len(data) == 0:
+            return {
+                'type': self.get_insight_type(),
+                'error': 'No valid data',
+                'summary': f'No valid data in column {column}.'
+            }
+        # Calculate statistics
+        value_counts = data.value_counts()
+        statistics = {
+            'count': len(data),
+            'unique_values': data.nunique(),
+            'most_common': value_counts.index[0],
+            'most_common_count': value_counts.iloc[0],
+            'most_common_percentage': safe_divide(value_counts.iloc[0], len(data)),
+            'least_common': value_counts.index[-1],
+            'least_common_count': value_counts.iloc[-1]
+        }
+        insight = {
+            'type': self.get_insight_type(),
+            'column': column,
+            'data_type': 'categorical',
+            'statistics': statistics,
+            'value_counts': value_counts.head(10).to_dict(),
+            'summary': self._generate_categorical_summary(column, statistics)
+        }
+        logger.info(f"Generated distribution insight for {column}")
+        return insight
+    def _determine_shape(self, skewness: float, kurtosis: float) -> str:
+        """Determine distribution shape from skewness and kurtosis."""
+        if abs(skewness) < 0.5 and abs(kurtosis) < 0.5:
+            return 'approximately normal'
+        elif skewness > 0.5:
+            return 'right-skewed (positive skew)'
+        elif skewness < -0.5:
+            return 'left-skewed (negative skew)'
+        elif kurtosis > 1:
+            return 'heavy-tailed (leptokurtic)'
+        elif kurtosis < -1:
+            return 'light-tailed (platykurtic)'
+        else:
+            return 'mixed characteristics'
+    def _generate_numerical_summary(self, column: str,
+                                    stats: Dict, shape: str) -> str:
+        """Generate summary for numerical distribution."""
+        summary = f"{column} has a {shape} distribution. "
+        summary += f"Mean: {format_number(stats['mean'])}, "
+        summary += f"Median: {format_number(stats['median'])}, "
+        summary += f"Std Dev: {format_number(stats['std'])}. "
+        summary += f"Range: {format_number(stats['min'])} to {format_number(stats['max'])}."
+        return summary
+    def _generate_categorical_summary(self, column: str, stats: Dict) -> str:
+        """Generate summary for categorical distribution."""
+        summary = f"{column} has {stats['unique_values']} unique values. "
+        summary += f"Most common: '{stats['most_common']}' "
+        summary += f"({format_percentage(stats['most_common_percentage'])})."
+        return summary
+# ============================================================================
+# CORRELATION INSIGHTS
+# ============================================================================
+class CorrelationInsights(InsightStrategy):
+    """
+    Identify strong correlations between variables.
+    Follows Single Responsibility Principle - only handles correlation analysis.
+    """
+    def get_insight_type(self) -> str:
+        """Get insight type."""
+        return "correlation_insights"
+    def generate(self, df: pd.DataFrame,
+                 columns: Optional[List[str]] = None,
+                 threshold: float = 0.7,
+                 method: str = 'pearson',
+                 **kwargs) -> Dict[str, Any]:
+        """
+        Generate correlation insights.
+        Args:
+            df: DataFrame to analyze
+            columns: Optional list of columns to analyze
+            threshold: Correlation threshold for strong correlations
+            method: Correlation method ('pearson', 'spearman', 'kendall')
+            **kwargs: Additional parameters
+        Returns:
+            Dict with correlation insights
+        """
+        # Validate inputs
+        DataFrameValidator().validate(df)
+        # Select numerical columns
+        if columns:
+            ColumnValidator().validate(df, columns)
+            df_corr = df[columns].select_dtypes(include=[np.number])
+        else:
+            df_corr = df.select_dtypes(include=[np.number])
+        if df_corr.shape[1] < 2:
+            return {
+                'type': self.get_insight_type(),
+                'error': 'Insufficient numerical columns',
+                'summary': 'Need at least 2 numerical columns for correlation analysis.'
+            }
+        # Calculate correlation matrix
+        corr_matrix = df_corr.corr(method=method)
+        # Find strong correlations
+        strong_correlations = []
+        for i in range(len(corr_matrix.columns)):
+            for j in range(i + 1, len(corr_matrix.columns)):
+                corr_value = corr_matrix.iloc[i, j]
+                if abs(corr_value) >= threshold:
+                    strong_correlations.append({
+                        'variable1': corr_matrix.columns[i],
+                        'variable2': corr_matrix.columns[j],
+                        'correlation': corr_value,
+                        'strength': self._classify_strength(abs(corr_value)),
+                        'direction': 'positive' if corr_value > 0 else 'negative'
+                    })
+        # Sort by absolute correlation value
+        strong_correlations.sort(key=lambda x: abs(x['correlation']), reverse=True)
+        insight = {
+            'type': self.get_insight_type(),
+            'method': method,
+            'threshold': threshold,
+            'total_pairs_analyzed': len(corr_matrix.columns) * (len(corr_matrix.columns) - 1) // 2,
+            'strong_correlations_found': len(strong_correlations),
+            'correlations': strong_correlations[:10],  # Top 10
+            'summary': self._generate_summary(strong_correlations, threshold)
+        }
+        logger.info(f"Generated correlation insights with {len(strong_correlations)} strong correlations")
+        return insight
+    def _classify_strength(self, abs_corr: float) -> str:
+        """Classify correlation strength."""
+        if abs_corr >= 0.9:
+            return 'very strong'
+        elif abs_corr >= 0.7:
+            return 'strong'
+        elif abs_corr >= 0.5:
+            return 'moderate'
+        elif abs_corr >= 0.3:
+            return 'weak'
+        else:
+            return 'very weak'
+    def _generate_summary(self, correlations: List[Dict], threshold: float) -> str:
+        """Generate human-readable summary."""
+        if len(correlations) == 0:
+            return f"No strong correlations (threshold: {threshold}) found."
+        top = correlations[0]
+        summary = f"Found {len(correlations)} strong correlations. "
+        summary += f"Strongest: {top['variable1']} and {top['variable2']} "
+        summary += f"({top['direction']}, {format_number(top['correlation'])})."
+        return summary
+# ============================================================================
+# INSIGHT MANAGER
+# Uses Strategy Pattern to manage different insight types
+# ============================================================================
+class InsightManager:
+    """
+    Manager class for insights using Strategy Pattern.
+    Follows Open/Closed Principle - open for extension, closed for modification.
+    """
+    def __init__(self):
+        """Initialize InsightManager with all available strategies."""
+        self.strategies: Dict[str, InsightStrategy] = {
+            'top_bottom': TopBottomPerformers(),
+            'trend': TrendAnalysis(),
+            'anomaly': AnomalyDetection(),
+            'distribution': DistributionInsights(),
+            'correlation': CorrelationInsights()
+        }
+    def generate_insight(self, insight_type: str, df: pd.DataFrame, **kwargs) -> Dict[str, Any]:
+        """
+        Generate insight using specified strategy.
+        Args:
+            insight_type: Type of insight to generate
+            df: DataFrame to analyze
+            **kwargs: Parameters specific to insight type
+        Returns:
+            Dict with insight information
+        Raises:
+            ValueError: If insight type is not supported
+        """
+        if insight_type not in self.strategies:
+            raise ValueError(
+                f"Unsupported insight type: {insight_type}. "
+                f"Available types: {list(self.strategies.keys())}"
+            )
+        strategy = self.strategies[insight_type]
+        return strategy.generate(df, **kwargs)
+    def generate_all_insights(self, df: pd.DataFrame,
+                              config: Optional[Dict[str, Dict]] = None) -> Dict[str, Dict[str, Any]]:
+        """
+        Generate all available insights.
+        Args:
+            df: DataFrame to analyze
+            config: Optional configuration for each insight type
+        Returns:
+            Dict with all insights
+        """
+        all_insights = {}
+        # Get column types
+        column_types = get_column_types(df)
+        # Generate insights based on available data
+        try:
+            # Top/Bottom performers (if numerical columns exist)
+            if len(column_types['numerical']) > 0:
+                col = column_types['numerical'][0]
+                params = config.get('top_bottom', {}) if config else {}
+                all_insights['top_bottom'] = self.generate_insight(
+                    'top_bottom', df, column=col, **params
+                )
+        except Exception as e:
+            logger.warning(f"Could not generate top/bottom insight: {e}")
+        try:
+            # Distribution insights
+            if len(column_types['numerical']) > 0:
+                col = column_types['numerical'][0]
+                params = config.get('distribution', {}) if config else {}
+                all_insights['distribution'] = self.generate_insight(
+                    'distribution', df, column=col, **params
+                )
+        except Exception as e:
+            logger.warning(f"Could not generate distribution insight: {e}")
+        try:
+            # Anomaly detection
+            if len(column_types['numerical']) > 0:
+                col = column_types['numerical'][0]
+                params = config.get('anomaly', {}) if config else {}
+                all_insights['anomaly'] = self.generate_insight(
+                    'anomaly', df, column=col, **params
+                )
+        except Exception as e:
+            logger.warning(f"Could not generate anomaly insight: {e}")
+        try:
+            # Correlation insights
+            if len(column_types['numerical']) >= 2:
+                params = config.get('correlation', {}) if config else {}
+                all_insights['correlation'] = self.generate_insight(
+                    'correlation', df, **params
+                )
+        except Exception as e:
+            logger.warning(f"Could not generate correlation insight: {e}")
+        try:
+            # Trend analysis (if datetime columns exist)
+            if len(column_types['datetime']) > 0 and len(column_types['numerical']) > 0:
+                date_col = column_types['datetime'][0]
+                value_col = column_types['numerical'][0]
+                params = config.get('trend', {}) if config else {}
+                all_insights['trend'] = self.generate_insight(
+                    'trend', df, date_column=date_col, value_column=value_col, **params
+                )
+        except Exception as e:
+            logger.warning(f"Could not generate trend insight: {e}")
+        return all_insights
+    def add_strategy(self, name: str, strategy: InsightStrategy) -> None:
+        """
+        Add new insight strategy.
+        Follows Open/Closed Principle - extend functionality without modifying existing code.
+        Args:
+            name: Name for the strategy
+            strategy: Insight strategy instance
+        """
+        self.strategies[name] = strategy
+        logger.info(f"Added new insight strategy: {name}")
+    def get_available_insights(self) -> List[str]:
+        """
+        Get list of available insight types.
+        Returns:
+            List of insight type names
+        """
+        return list(self.strategies.keys())
+    def format_insight_report(self, insights: Dict[str, Dict[str, Any]]) -> str:
+        """
+        Format insights into a readable report.
+        Args:
+            insights: Dict of insights from generate_all_insights
+        Returns:
+            Formatted string report
+        """
+        report = "=" * 80 + "\n"
+        report += "AUTOMATED INSIGHTS REPORT\n"
+        report += "=" * 80 + "\n\n"
+        for insight_name, insight_data in insights.items():
+            report += f"\n{insight_name.upper().replace('_', ' ')}\n"
+            report += "-" * 80 + "\n"
+            if 'error' in insight_data:
+                report += f"Error: {insight_data['error']}\n"
+            elif 'summary' in insight_data:
+                report += f"{insight_data['summary']}\n"
+            report += "\n"
+        report += "=" * 80 + "\n"
+        return report
+if __name__ == "__main__":
+    # Example usage
+    print("Insights module loaded successfully")
+    # Demonstrate available insights
+    manager = InsightManager()
+    print(f"Available insights: {manager.get_available_insights()}")

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+gradio
+pandas
+numpy
+matplotlib
+seaborn
+plotly
+openpyxl

tests/__pycache__/conftest.cpython-310-pytest-8.4.2.pyc ADDED Viewed

Binary file (411 Bytes). View file

tests/__pycache__/test_data_processor.cpython-310-pytest-8.4.2.pyc ADDED Viewed

Binary file (33 kB). View file

tests/__pycache__/test_insights.cpython-310-pytest-8.4.2.pyc ADDED Viewed

Binary file (29.5 kB). View file

tests/__pycache__/test_utils.cpython-310-pytest-8.4.2.pyc ADDED Viewed

Binary file (29.6 kB). View file

tests/__pycache__/test_visualizations.cpython-310-pytest-8.4.2.pyc ADDED Viewed

Binary file (26.5 kB). View file

tests/conftest.py ADDED Viewed

	@@ -0,0 +1,5 @@

+import sys
+from pathlib import Path
+# Add the parent directory to Python path so tests can import modules
+sys.path.insert(0, str(Path(__file__).parent.parent))

tests/test_app.py ADDED Viewed

File without changes

tests/test_data_processor.py ADDED Viewed

	@@ -0,0 +1,453 @@

+"""
+Unit Tests for Data Processor Module
+Comprehensive tests for all data processing functionality including
+Strategy Pattern implementation, data loading, cleaning, and filtering.
+Author: Craig
+Date: December 2024
+"""
+import pytest
+import pandas as pd
+import numpy as np
+from pathlib import Path
+import tempfile
+import os
+from datetime import datetime
+from data_processor import (
+    DataLoadStrategy, CSVLoadStrategy, ExcelLoadStrategy,
+    JSONLoadStrategy, ParquetLoadStrategy,
+    DataLoader, DataCleaner, DataProfiler, DataFilter, DataProcessor
+)
+# ============================================================================
+# FIXTURES
+# ============================================================================
+@pytest.fixture
+def sample_dataframe():
+    """Create a sample DataFrame for testing."""
+    return pd.DataFrame({
+        'id': [1, 2, 3, 4, 5],
+        'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
+        'age': [25, 30, 35, 40, 45],
+        'salary': [50000, 60000, 70000, 80000, 90000],
+        'department': ['HR', 'IT', 'IT', 'Finance', 'HR'],
+        'hire_date': pd.date_range('2020-01-01', periods=5)
+    })
+@pytest.fixture
+def dataframe_with_missing():
+    """Create DataFrame with missing values."""
+    return pd.DataFrame({
+        'col1': [1, 2, np.nan, 4, 5],
+        'col2': ['a', 'b', 'c', np.nan, 'e'],
+        'col3': [10.5, np.nan, 30.5, 40.5, 50.5]
+    })
+@pytest.fixture
+def dataframe_with_duplicates():
+    """Create DataFrame with duplicate rows."""
+    return pd.DataFrame({
+        'id': [1, 2, 3, 2, 4],
+        'value': ['a', 'b', 'c', 'b', 'd']
+    })
+@pytest.fixture
+def temp_csv_file(sample_dataframe):
+    """Create temporary CSV file."""
+    temp_path = tempfile.mktemp(suffix='.csv')
+    sample_dataframe.to_csv(temp_path, index=False)
+    yield temp_path
+    if os.path.exists(temp_path):
+        os.remove(temp_path)
+@pytest.fixture
+def temp_excel_file(sample_dataframe):
+    """Create temporary Excel file."""
+    temp_path = tempfile.mktemp(suffix='.xlsx')
+    sample_dataframe.to_excel(temp_path, index=False)
+    yield temp_path
+    if os.path.exists(temp_path):
+        os.remove(temp_path)
+@pytest.fixture
+def temp_json_file(sample_dataframe):
+    """Create temporary JSON file."""
+    temp_path = tempfile.mktemp(suffix='.json')
+    # Drop datetime column for JSON compatibility
+    df_json = sample_dataframe.drop('hire_date', axis=1)
+    df_json.to_json(temp_path)
+    yield temp_path
+    if os.path.exists(temp_path):
+        os.remove(temp_path)
+# ============================================================================
+# STRATEGY PATTERN TESTS
+# ============================================================================
+class TestCSVLoadStrategy:
+    """Test suite for CSVLoadStrategy."""
+    def test_can_handle_csv(self):
+        """Test CSV file detection."""
+        strategy = CSVLoadStrategy()
+        assert strategy.can_handle('file.csv') is True
+        assert strategy.can_handle('file.CSV') is True
+        assert strategy.can_handle('file.xlsx') is False
+    def test_load_csv(self, temp_csv_file):
+        """Test loading CSV file."""
+        strategy = CSVLoadStrategy()
+        df = strategy.load(temp_csv_file)
+        assert isinstance(df, pd.DataFrame)
+        assert len(df) > 0
+    def test_load_nonexistent_csv(self):
+        """Test loading non-existent CSV file."""
+        strategy = CSVLoadStrategy()
+        with pytest.raises(Exception):
+            strategy.load('nonexistent.csv')
+class TestExcelLoadStrategy:
+    """Test suite for ExcelLoadStrategy."""
+    def test_can_handle_excel(self):
+        """Test Excel file detection."""
+        strategy = ExcelLoadStrategy()
+        assert strategy.can_handle('file.xlsx') is True
+        assert strategy.can_handle('file.xls') is True
+        assert strategy.can_handle('file.XLSX') is True
+        assert strategy.can_handle('file.csv') is False
+    def test_load_excel(self, temp_excel_file):
+        """Test loading Excel file."""
+        strategy = ExcelLoadStrategy()
+        df = strategy.load(temp_excel_file)
+        assert isinstance(df, pd.DataFrame)
+        assert len(df) > 0
+class TestJSONLoadStrategy:
+    """Test suite for JSONLoadStrategy."""
+    def test_can_handle_json(self):
+        """Test JSON file detection."""
+        strategy = JSONLoadStrategy()
+        assert strategy.can_handle('file.json') is True
+        assert strategy.can_handle('file.JSON') is True
+        assert strategy.can_handle('file.csv') is False
+    def test_load_json(self, temp_json_file):
+        """Test loading JSON file."""
+        strategy = JSONLoadStrategy()
+        df = strategy.load(temp_json_file)
+        assert isinstance(df, pd.DataFrame)
+        assert len(df) > 0
+class TestParquetLoadStrategy:
+    """Test suite for ParquetLoadStrategy."""
+    def test_can_handle_parquet(self):
+        """Test Parquet file detection."""
+        strategy = ParquetLoadStrategy()
+        assert strategy.can_handle('file.parquet') is True
+        assert strategy.can_handle('file.PARQUET') is True
+        assert strategy.can_handle('file.csv') is False
+# ============================================================================
+# DATA LOADER TESTS
+# ============================================================================
+class TestDataLoader:
+    """Test suite for DataLoader class."""
+    def test_initialization(self):
+        """Test DataLoader initialization."""
+        loader = DataLoader()
+        assert len(loader.strategies) >= 4
+    def test_load_csv(self, temp_csv_file):
+        """Test loading CSV through DataLoader."""
+        loader = DataLoader()
+        df = loader.load_data(temp_csv_file)
+        assert isinstance(df, pd.DataFrame)
+        assert len(df) == 5
+    def test_load_excel(self, temp_excel_file):
+        """Test loading Excel through DataLoader."""
+        loader = DataLoader()
+        df = loader.load_data(temp_excel_file)
+        assert isinstance(df, pd.DataFrame)
+        assert len(df) == 5
+    def test_load_json(self, temp_json_file):
+        """Test loading JSON through DataLoader."""
+        loader = DataLoader()
+        df = loader.load_data(temp_json_file)
+        assert isinstance(df, pd.DataFrame)
+    def test_load_nonexistent_file(self):
+        """Test loading non-existent file."""
+        loader = DataLoader()
+        with pytest.raises(FileNotFoundError):
+            loader.load_data('nonexistent.csv')
+    def test_add_strategy(self):
+        """Test adding new strategy."""
+        loader = DataLoader()
+        initial_count = len(loader.strategies)
+        # Create mock strategy
+        class MockStrategy(DataLoadStrategy):
+            def can_handle(self, filepath):
+                return False
+            def load(self, filepath):
+                return pd.DataFrame()
+        loader.add_strategy(MockStrategy())
+        assert len(loader.strategies) == initial_count + 1
+# ============================================================================
+# DATA CLEANER TESTS
+# ============================================================================
+class TestDataCleaner:
+    """Test suite for DataCleaner class."""
+    def test_handle_missing_none(self, dataframe_with_missing):
+        """Test 'none' strategy - no changes."""
+        df_cleaned = DataCleaner.handle_missing_values(dataframe_with_missing, strategy='none')
+        assert df_cleaned.isnull().sum().sum() == dataframe_with_missing.isnull().sum().sum()
+    def test_handle_missing_drop(self, dataframe_with_missing):
+        """Test dropping rows with missing values."""
+        df_cleaned = DataCleaner.handle_missing_values(dataframe_with_missing, strategy='drop')
+        assert df_cleaned.isnull().sum().sum() == 0
+        assert len(df_cleaned) < len(dataframe_with_missing)
+    def test_handle_missing_fill_mean(self, dataframe_with_missing):
+        """Test filling with mean."""
+        df_cleaned = DataCleaner.handle_missing_values(dataframe_with_missing, strategy='fill_mean')
+        numerical_cols = df_cleaned.select_dtypes(include=[np.number]).columns
+        for col in numerical_cols:
+            assert df_cleaned[col].isnull().sum() == 0
+    def test_handle_missing_fill_median(self, dataframe_with_missing):
+        """Test filling with median."""
+        df_cleaned = DataCleaner.handle_missing_values(dataframe_with_missing, strategy='fill_median')
+        numerical_cols = df_cleaned.select_dtypes(include=[np.number]).columns
+        for col in numerical_cols:
+            assert df_cleaned[col].isnull().sum() == 0
+    def test_handle_missing_fill_mode(self, dataframe_with_missing):
+        """Test filling with mode."""
+        df_cleaned = DataCleaner.handle_missing_values(dataframe_with_missing, strategy='fill_mode')
+        # Check categorical columns are filled
+        assert df_cleaned['col2'].isnull().sum() == 0
+    def test_convert_data_types(self):
+        """Test automatic data type conversion."""
+        df = pd.DataFrame({
+            'price': ['$100', '$200', '$300'],
+            'date': ['2024-01-01', '2024-01-02', '2024-01-03'],
+            'bool_col': ['TRUE', 'FALSE', 'TRUE']
+        })
+        df_converted = DataCleaner.convert_data_types(df)
+        # Check currency conversion
+        assert pd.api.types.is_numeric_dtype(df_converted['price'])
+        # Check date conversion
+        assert pd.api.types.is_datetime64_any_dtype(df_converted['date'])
+    def test_remove_duplicates(self, dataframe_with_duplicates):
+        """Test removing duplicate rows."""
+        df_cleaned = DataCleaner.remove_duplicates(dataframe_with_duplicates)
+        assert len(df_cleaned) < len(dataframe_with_duplicates)
+        assert df_cleaned.duplicated().sum() == 0
+    def test_handle_outliers_zscore(self):
+        """Test outlier removal using z-score."""
+        df = pd.DataFrame({
+            'values': [1, 2, 3, 4, 5, 100]  # 100 is an outlier
+        })
+        df_cleaned = DataCleaner.handle_outliers(df, ['values'], method='zscore', threshold=2.0)
+        assert len(df_cleaned) < len(df)
+        assert 100 not in df_cleaned['values'].values
+    def test_handle_outliers_iqr(self):
+        """Test outlier removal using IQR."""
+        df = pd.DataFrame({
+            'values': [1, 2, 3, 4, 5, 100]  # 100 is an outlier
+        })
+        df_cleaned = DataCleaner.handle_outliers(df, ['values'], method='iqr', threshold=1.5)
+        assert len(df_cleaned) < len(df)
+# ============================================================================
+# DATA PROFILER TESTS
+# ============================================================================
+class TestDataProfiler:
+    """Test suite for DataProfiler class."""
+    def test_initialization(self, sample_dataframe):
+        """Test DataProfiler initialization."""
+        profiler = DataProfiler(sample_dataframe)
+        assert profiler.df is not None
+    def test_initialization_empty_dataframe(self):
+        """Test initialization with empty DataFrame."""
+        with pytest.raises(ValueError):
+            DataProfiler(pd.DataFrame())
+    def test_get_basic_info(self, sample_dataframe):
+        """Test getting basic info."""
+        profiler = DataProfiler(sample_dataframe)
+        info = profiler.get_basic_info()
+        assert info['rows'] == 5
+        assert info['columns'] == 6
+        assert 'column_names' in info
+        assert 'data_types' in info
+        assert 'memory_usage' in info
+    def test_get_missing_values_report(self, dataframe_with_missing):
+        """Test missing values report."""
+        profiler = DataProfiler(dataframe_with_missing)
+        report = profiler.get_missing_values_report()
+        assert isinstance(report, pd.DataFrame)
+        assert len(report) > 0
+        assert 'Missing_Count' in report.columns
+        assert 'Missing_Percentage' in report.columns
+    def test_get_numerical_summary(self, sample_dataframe):
+        """Test numerical summary statistics."""
+        profiler = DataProfiler(sample_dataframe)
+        summary = profiler.get_numerical_summary()
+        assert isinstance(summary, pd.DataFrame)
+        assert 'age' in summary.columns
+        assert 'salary' in summary.columns
+    def test_get_categorical_summary(self, sample_dataframe):
+        """Test categorical summary statistics."""
+        profiler = DataProfiler(sample_dataframe)
+        summary = profiler.get_categorical_summary()
+        assert isinstance(summary, dict)
+        assert 'department' in summary
+        assert 'unique_count' in summary['department']
+        assert 'top_value' in summary['department']
+    def test_get_correlation_matrix(self, sample_dataframe):
+        """Test correlation matrix generation."""
+        profiler = DataProfiler(sample_dataframe)
+        corr_matrix = profiler.get_correlation_matrix()
+        assert isinstance(corr_matrix, pd.DataFrame)
+        assert 'age' in corr_matrix.columns
+        assert 'salary' in corr_matrix.columns
+    def test_get_full_profile(self, sample_dataframe):
+        """Test full profile generation."""
+        profiler = DataProfiler(sample_dataframe)
+        profile = profiler.get_full_profile()
+        assert 'basic_info' in profile
+        assert 'missing_values' in profile
+        assert 'numerical_summary' in profile
+        assert 'categorical_summary' in profile
+        assert 'correlation_matrix' in profile
+# ============================================================================
+# DATA FILTER TESTS
+# ============================================================================
+class TestDataFilter:
+    """Test suite for DataFilter class."""
+    def test_filter_numerical_min(self, sample_dataframe):
+        """Test filtering with minimum value."""
+        filtered = DataFilter.filter_numerical(sample_dataframe, 'age', min_val=30)
+        assert len(filtered) == 4
+        assert filtered['age'].min() >= 30
+    def test_filter_numerical_max(self, sample_dataframe):
+        """Test filtering with maximum value."""
+        filtered = DataFilter.filter_numerical(sample_dataframe, 'age', max_val=35)
+        assert len(filtered) == 3
+        assert filtered['age'].max() <= 35
+    def test_filter_numerical_range(self, sample_dataframe):
+        """Test filtering with range."""
+        filtered = DataFilter.filter_numerical(sample_dataframe, 'age', min_val=30, max_val=40)
+        assert len(filtered) == 3
+        assert filtered['age'].min() >= 30
+        assert filtered['age'].max() <= 40
+    def test_filter_categorical(self, sample_dataframe):
+        """Test categorical filtering."""
+        filtered = DataFilter.filter_categorical(sample_dataframe, 'department', ['IT', 'HR'])
+        assert len(filtered) == 4
+        assert all(filtered['department'].isin(['IT', 'HR']))
+    def test_filter_categorical_empty_values(self, sample_dataframe):
+        """Test categorical filtering with empty values list."""
+        filtered = DataFilter.filter_categorical(sample_dataframe, 'department', [])
+        assert len(filtered) == len(sample_dataframe)
+    def test_filter_date_range(self, sample_dataframe):
+        """Test date range filtering."""
+        start_date = pd.Timestamp('2020-01-02')
+        end_date = pd.Timestamp('2020-01-04')
+        filtered = DataFilter.filter_date_range(sample_dataframe, 'hire_date', start_date, end_date)
+        assert len(filtered) == 3
+    def test_apply_multiple_filters(self, sample_dataframe):
+        """Test applying multiple filters."""
+        filters = [
+            {'type': 'numerical', 'column': 'age', 'min_val': 30, 'max_val': 40},
+            {'type': 'categorical', 'column': 'department', 'values': ['IT', 'Finance']}
+        ]
+        filtered = DataFilter.apply_multiple_filters(sample_dataframe, filters)
+        assert len(filtered) <= len(sample_dataframe)
+    def test_filter_invalid_column(self, sample_dataframe):
+        """Test filtering with invalid column."""
+        with pytest.raises(ValueError):
+            DataFilter.filter_numerical(sample_dataframe, 'nonexistent', min_val=0)
+# ============================================================================
+# DATA PROCESSOR TESTS (Facade)
+# ============================================================================
+class TestDataProcessor:
+    """Test suite for DataProcessor class (Facade)."""
+    def test_initialization(self):
+        """Test DataProcessor initialization."""
+        processor = DataProcessor()

tests/test_insights.py ADDED Viewed

	@@ -0,0 +1,554 @@

+"""
+Unit Tests for Insights Module
+Comprehensive tests for all insight strategies and the insight manager.
+Author: Craig
+Date: December 2024
+"""
+import pytest
+import pandas as pd
+import numpy as np
+from datetime import datetime, timedelta
+from insights import (
+    InsightStrategy, TopBottomPerformers, TrendAnalysis,
+    AnomalyDetection, DistributionInsights, CorrelationInsights,
+    InsightManager
+)
+# ============================================================================
+# FIXTURES
+# ============================================================================
+@pytest.fixture
+def sales_data():
+    """Create sample sales data."""
+    return pd.DataFrame({
+        'product': ['A', 'B', 'C', 'D', 'E'] * 20,
+        'sales': np.random.randint(100, 1000, 100),
+        'revenue': np.random.uniform(1000, 5000, 100),
+        'region': np.random.choice(['North', 'South', 'East', 'West'], 100)
+    })
+@pytest.fixture
+def time_series_data():
+    """Create sample time series data."""
+    dates = pd.date_range('2024-01-01', periods=100, freq='D')
+    values = np.cumsum(np.random.randn(100)) + 100  # Random walk with trend
+    return pd.DataFrame({
+        'date': dates,
+        'value': values,
+        'sales': np.random.randint(50, 200, 100)
+    })
+@pytest.fixture
+def anomaly_data():
+    """Create data with anomalies."""
+    # Normal data with a few outliers
+    normal = np.random.normal(100, 10, 95)
+    outliers = np.array([200, 10, 250, 5, 220])
+    data = np.concatenate([normal, outliers])
+    np.random.shuffle(data)
+    return pd.DataFrame({
+        'values': data,
+        'category': np.random.choice(['A', 'B', 'C'], 100)
+    })
+@pytest.fixture
+def correlation_data():
+    """Create data with correlations."""
+    np.random.seed(42)
+    x = np.random.normal(50, 10, 100)
+    y = 2 * x + np.random.normal(0, 5, 100)  # Strong positive correlation
+    z = -1.5 * x + np.random.normal(0, 8, 100)  # Strong negative correlation
+    w = np.random.normal(100, 15, 100)  # No correlation
+    return pd.DataFrame({
+        'var_x': x,
+        'var_y': y,
+        'var_z': z,
+        'var_w': w
+    })
+@pytest.fixture
+def mixed_data():
+    """Create data with mixed types."""
+    return pd.DataFrame({
+        'numerical': np.random.normal(100, 15, 100),
+        'categorical': np.random.choice(['Cat1', 'Cat2', 'Cat3'], 100),
+        'date': pd.date_range('2024-01-01', periods=100),
+        'sales': np.random.randint(50, 500, 100)
+    })
+# ============================================================================
+# TOP/BOTTOM PERFORMERS TESTS
+# ============================================================================
+class TestTopBottomPerformers:
+    """Test suite for TopBottomPerformers class."""
+    def test_initialization(self):
+        """Test TopBottomPerformers initialization."""
+        insight = TopBottomPerformers()
+        assert insight is not None
+    def test_get_insight_type(self):
+        """Test getting insight type."""
+        insight = TopBottomPerformers()
+        assert insight.get_insight_type() == "top_bottom_performers"
+    def test_generate_simple(self, sales_data):
+        """Test generating simple top/bottom insights."""
+        insight = TopBottomPerformers()
+        result = insight.generate(sales_data, column='sales')
+        assert result['type'] == 'top_bottom_performers'
+        assert 'top_performers' in result
+        assert 'bottom_performers' in result
+        assert 'summary' in result
+    def test_generate_with_groupby(self, sales_data):
+        """Test generating insights with groupby."""
+        insight = TopBottomPerformers()
+        result = insight.generate(
+            sales_data,
+            column='sales',
+            group_by='product',
+            aggregation='sum'
+        )
+        assert result['group_by'] == 'product'
+        assert result['aggregation'] == 'sum'
+        assert len(result['top_performers']['data']) > 0
+    def test_generate_with_custom_n(self, sales_data):
+        """Test with custom top_n and bottom_n."""
+        insight = TopBottomPerformers()
+        result = insight.generate(
+            sales_data,
+            column='sales',
+            top_n=3,
+            bottom_n=3
+        )
+        assert result['top_performers']['count'] <= 3
+        assert result['bottom_performers']['count'] <= 3
+    def test_invalid_column(self, sales_data):
+        """Test with invalid column."""
+        insight = TopBottomPerformers()
+        with pytest.raises(ValueError):
+            insight.generate(sales_data, column='nonexistent')
+# ============================================================================
+# TREND ANALYSIS TESTS
+# ============================================================================
+class TestTrendAnalysis:
+    """Test suite for TrendAnalysis class."""
+    def test_initialization(self):
+        """Test TrendAnalysis initialization."""
+        insight = TrendAnalysis()
+        assert insight is not None
+    def test_get_insight_type(self):
+        """Test getting insight type."""
+        insight = TrendAnalysis()
+        assert insight.get_insight_type() == "trend_analysis"
+    def test_generate_trend(self, time_series_data):
+        """Test generating trend insights."""
+        insight = TrendAnalysis()
+        result = insight.generate(
+            time_series_data,
+            date_column='date',
+            value_column='value'
+        )
+        assert result['type'] == 'trend_analysis'
+        assert 'trend_direction' in result
+        assert 'metrics' in result
+        assert 'date_range' in result
+        assert 'summary' in result
+    def test_trend_metrics(self, time_series_data):
+        """Test trend metrics calculation."""
+        insight = TrendAnalysis()
+        result = insight.generate(
+            time_series_data,
+            date_column='date',
+            value_column='value'
+        )
+        metrics = result['metrics']
+        assert 'first_value' in metrics
+        assert 'last_value' in metrics
+        assert 'absolute_change' in metrics
+        assert 'percentage_change' in metrics
+        assert 'growth_rate' in metrics
+        assert 'volatility' in metrics
+    def test_insufficient_data(self):
+        """Test with insufficient data."""
+        df = pd.DataFrame({
+            'date': [pd.Timestamp('2024-01-01')],
+            'value': [100]
+        })
+        insight = TrendAnalysis()
+        result = insight.generate(df, date_column='date', value_column='value')
+        assert 'error' in result
+    def test_invalid_columns(self, time_series_data):
+        """Test with invalid columns."""
+        insight = TrendAnalysis()
+        with pytest.raises(ValueError):
+            insight.generate(
+                time_series_data,
+                date_column='nonexistent',
+                value_column='value'
+            )
+# ============================================================================
+# ANOMALY DETECTION TESTS
+# ============================================================================
+class TestAnomalyDetection:
+    """Test suite for AnomalyDetection class."""
+    def test_initialization(self):
+        """Test AnomalyDetection initialization."""
+        insight = AnomalyDetection()
+        assert insight is not None
+    def test_get_insight_type(self):
+        """Test getting insight type."""
+        insight = AnomalyDetection()
+        assert insight.get_insight_type() == "anomaly_detection"
+    def test_detect_zscore(self, anomaly_data):
+        """Test Z-score anomaly detection."""
+        insight = AnomalyDetection()
+        result = insight.generate(
+            anomaly_data,
+            column='values',
+            method='zscore',
+            threshold=2.5
+        )
+        assert result['type'] == 'anomaly_detection'
+        assert result['method'] == 'zscore'
+        assert 'statistics' in result
+        assert 'anomalies' in result
+    def test_detect_iqr(self, anomaly_data):
+        """Test IQR anomaly detection."""
+        insight = AnomalyDetection()
+        result = insight.generate(
+            anomaly_data,
+            column='values',
+            method='iqr',
+            threshold=1.5
+        )
+        assert result['method'] == 'iqr'
+        assert result['statistics']['anomaly_count'] >= 0
+    def test_no_anomalies(self):
+        """Test when no anomalies are found."""
+        df = pd.DataFrame({
+            'values': np.random.normal(100, 1, 100)  # Very tight distribution
+        })
+        insight = AnomalyDetection()
+        result = insight.generate(df, column='values', threshold=10)
+        assert result['statistics']['anomaly_count'] == 0
+    def test_non_numerical_column(self, sales_data):
+        """Test with non-numerical column."""
+        insight = AnomalyDetection()
+        result = insight.generate(sales_data, column='product')
+        assert 'error' in result
+    def test_invalid_method(self, anomaly_data):
+        """Test with invalid method."""
+        insight = AnomalyDetection()
+        with pytest.raises(ValueError):
+            insight.generate(anomaly_data, column='values', method='invalid')
+# ============================================================================
+# DISTRIBUTION INSIGHTS TESTS
+# ============================================================================
+class TestDistributionInsights:
+    """Test suite for DistributionInsights class."""
+    def test_initialization(self):
+        """Test DistributionInsights initialization."""
+        insight = DistributionInsights()
+        assert insight is not None
+    def test_get_insight_type(self):
+        """Test getting insight type."""
+        insight = DistributionInsights()
+        assert insight.get_insight_type() == "distribution_insights"
+    def test_numerical_distribution(self, sales_data):
+        """Test numerical distribution analysis."""
+        insight = DistributionInsights()
+        result = insight.generate(sales_data, column='sales')
+        assert result['type'] == 'distribution_insights'
+        assert result['data_type'] == 'numerical'
+        assert 'statistics' in result
+        assert 'distribution_shape' in result
+    def test_numerical_statistics(self, sales_data):
+        """Test numerical statistics calculation."""
+        insight = DistributionInsights()
+        result = insight.generate(sales_data, column='sales')
+        stats = result['statistics']
+        assert 'mean' in stats
+        assert 'median' in stats
+        assert 'std' in stats
+        assert 'skewness' in stats
+        assert 'kurtosis' in stats
+    def test_categorical_distribution(self, sales_data):
+        """Test categorical distribution analysis."""
+        insight = DistributionInsights()
+        result = insight.generate(sales_data, column='product')
+        assert result['data_type'] == 'categorical'
+        assert 'value_counts' in result
+        assert 'most_common' in result['statistics']
+    def test_empty_column(self):
+        """Test with empty column."""
+        df = pd.DataFrame({'col': [np.nan, np.nan, np.nan]})
+        insight = DistributionInsights()
+        result = insight.generate(df, column='col')
+        assert 'error' in result
+# ============================================================================
+# CORRELATION INSIGHTS TESTS
+# ============================================================================
+class TestCorrelationInsights:
+    """Test suite for CorrelationInsights class."""
+    def test_initialization(self):
+        """Test CorrelationInsights initialization."""
+        insight = CorrelationInsights()
+        assert insight is not None
+    def test_get_insight_type(self):
+        """Test getting insight type."""
+        insight = CorrelationInsights()
+        assert insight.get_insight_type() == "correlation_insights"
+    def test_generate_correlations(self, correlation_data):
+        """Test generating correlation insights."""
+        insight = CorrelationInsights()
+        result = insight.generate(correlation_data, threshold=0.5)
+        assert result['type'] == 'correlation_insights'
+        assert 'strong_correlations_found' in result
+        assert 'correlations' in result
+    def test_strong_correlations_found(self, correlation_data):
+        """Test that strong correlations are found."""
+        insight = CorrelationInsights()
+        result = insight.generate(correlation_data, threshold=0.7)
+        # Should find strong correlations in our test data
+        assert result['strong_correlations_found'] > 0
+    def test_correlation_details(self, correlation_data):
+        """Test correlation details."""
+        insight = CorrelationInsights()
+        result = insight.generate(correlation_data, threshold=0.5)
+        if len(result['correlations']) > 0:
+            corr = result['correlations'][0]
+            assert 'variable1' in corr
+            assert 'variable2' in corr
+            assert 'correlation' in corr
+            assert 'strength' in corr
+            assert 'direction' in corr
+    def test_different_methods(self, correlation_data):
+        """Test different correlation methods."""
+        insight = CorrelationInsights()
+        # Pearson
+        result1 = insight.generate(correlation_data, method='pearson')
+        assert result1['method'] == 'pearson'
+        # Spearman
+        result2 = insight.generate(correlation_data, method='spearman')
+        assert result2['method'] == 'spearman'
+    def test_insufficient_columns(self):
+        """Test with insufficient numerical columns."""
+        df = pd.DataFrame({'col': [1, 2, 3]})
+        insight = CorrelationInsights()
+        result = insight.generate(df)
+        assert 'error' in result
+# ============================================================================
+# INSIGHT MANAGER TESTS
+# ============================================================================
+class TestInsightManager:
+    """Test suite for InsightManager class."""
+    def test_initialization(self):
+        """Test InsightManager initialization."""
+        manager = InsightManager()
+        assert manager is not None
+        assert len(manager.strategies) >= 5
+    def test_get_available_insights(self):
+        """Test getting available insights."""
+        manager = InsightManager()
+        available = manager.get_available_insights()
+        assert 'top_bottom' in available
+        assert 'trend' in available
+        assert 'anomaly' in available
+        assert 'distribution' in available
+        assert 'correlation' in available
+    def test_generate_top_bottom(self, sales_data):
+        """Test generating top/bottom insight through manager."""
+        manager = InsightManager()
+        result = manager.generate_insight(
+            'top_bottom',
+            sales_data,
+            column='sales'
+        )
+        assert result['type'] == 'top_bottom_performers'
+    def test_generate_trend(self, time_series_data):
+        """Test generating trend insight through manager."""
+        manager = InsightManager()
+        result = manager.generate_insight(
+            'trend',
+            time_series_data,
+            date_column='date',
+            value_column='value'
+        )
+        assert result['type'] == 'trend_analysis'
+    def test_generate_anomaly(self, anomaly_data):
+        """Test generating anomaly insight through manager."""
+        manager = InsightManager()
+        result = manager.generate_insight(
+            'anomaly',
+            anomaly_data,
+            column='values'
+        )
+        assert result['type'] == 'anomaly_detection'
+    def test_generate_distribution(self, sales_data):
+        """Test generating distribution insight through manager."""
+        manager = InsightManager()
+        result = manager.generate_insight(
+            'distribution',
+            sales_data,
+            column='sales'
+        )
+        assert result['type'] == 'distribution_insights'
+    def test_generate_correlation(self, correlation_data):
+        """Test generating correlation insight through manager."""
+        manager = InsightManager()
+        result = manager.generate_insight(
+            'correlation',
+            correlation_data
+        )
+        assert result['type'] == 'correlation_insights'
+    def test_unsupported_insight_type(self, sales_data):
+        """Test with unsupported insight type."""
+        manager = InsightManager()
+        with pytest.raises(ValueError, match="Unsupported insight type"):
+            manager.generate_insight('invalid_type', sales_data)
+    def test_generate_all_insights(self, mixed_data):
+        """Test generating all insights."""
+        manager = InsightManager()
+        results = manager.generate_all_insights(mixed_data)
+        assert isinstance(results, dict)
+        # Should generate at least some insights
+        assert len(results) > 0
+    def test_add_strategy(self):
+        """Test adding new strategy."""
+        manager = InsightManager()
+        initial_count = len(manager.strategies)
+        # Create mock strategy
+        class MockStrategy(InsightStrategy):
+            def generate(self, df, **kwargs):
+                return {'type': 'mock'}
+            def get_insight_type(self):
+                return 'mock'
+        manager.add_strategy('mock', MockStrategy())
+        assert len(manager.strategies) == initial_count + 1
+        assert 'mock' in manager.get_available_insights()
+    def test_format_insight_report(self, sales_data):
+        """Test formatting insight report."""
+        manager = InsightManager()
+        insights = {
+            'top_bottom': manager.generate_insight(
+                'top_bottom', sales_data, column='sales'
+            )
+        }
+        report = manager.format_insight_report(insights)
+        assert isinstance(report, str)
+        assert 'INSIGHTS REPORT' in report
+        assert 'TOP BOTTOM' in report
+# ============================================================================
+# RUN TESTS
+# ============================================================================
+if __name__ == "__main__":
+    pytest.main([__file__, "-v", "--tb=short"])

tests/test_utils.py ADDED Viewed

	@@ -0,0 +1,436 @@

+"""
+Unit Tests for Utils Module
+Tests all utility functions and classes following best practices.
+Uses pytest framework for comprehensive testing.
+Author: Craig
+Date: December 2024
+"""
+import pytest
+import pandas as pd
+import numpy as np
+from pathlib import Path
+import tempfile
+import os
+from utils import (
+    FileValidator, DataFrameValidator, ColumnValidator,
+    format_number, format_percentage, safe_divide,
+    get_column_types, detect_date_columns, clean_currency_column,
+    truncate_string, get_memory_usage,
+    CSVExporter, ExcelExporter, Config
+)
+# ============================================================================
+# FIXTURES
+# Reusable test data following DRY principle
+# ============================================================================
+@pytest.fixture
+def sample_dataframe():
+    """Create a sample DataFrame for testing."""
+    return pd.DataFrame({
+        'age': [25, 30, 35, 40],
+        'name': ['Alice', 'Bob', 'Charlie', 'David'],
+        'salary': [50000, 60000, 70000, 80000],
+        'date': pd.date_range('2024-01-01', periods=4)
+    })
+@pytest.fixture
+def empty_dataframe():
+    """Create an empty DataFrame for testing."""
+    return pd.DataFrame()
+@pytest.fixture
+def temp_csv_file():
+    """Create a temporary CSV file."""
+    with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:
+        f.write('col1,col2\n1,2\n3,4\n')
+        temp_path = f.name
+    yield temp_path
+    # Cleanup
+    if os.path.exists(temp_path):
+        os.remove(temp_path)
+@pytest.fixture
+def temp_xlsx_file():
+    """Create a temporary Excel file."""
+    temp_path = tempfile.mktemp(suffix='.xlsx')
+    df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
+    df.to_excel(temp_path, index=False)
+    yield temp_path
+    # Cleanup
+    if os.path.exists(temp_path):
+        os.remove(temp_path)
+# ============================================================================
+# VALIDATOR TESTS
+# ============================================================================
+class TestFileValidator:
+    """Test suite for FileValidator class."""
+    def test_validate_existing_csv(self, temp_csv_file):
+        """Test validation of existing CSV file."""
+        validator = FileValidator()
+        assert validator.validate(temp_csv_file) is True
+    def test_validate_existing_xlsx(self, temp_xlsx_file):
+        """Test validation of existing Excel file."""
+        validator = FileValidator()
+        assert validator.validate(temp_xlsx_file) is True
+    def test_validate_nonexistent_file(self):
+        """Test validation of non-existent file."""
+        validator = FileValidator()
+        with pytest.raises(FileNotFoundError):
+            validator.validate('nonexistent_file.csv')
+    def test_validate_unsupported_format(self):
+        """Test validation of unsupported file format."""
+        validator = FileValidator()
+        with tempfile.NamedTemporaryFile(suffix='.txt', delete=False) as f:
+            temp_path = f.name
+        try:
+            with pytest.raises(ValueError, match="Unsupported file format"):
+                validator.validate(temp_path)
+        finally:
+            if os.path.exists(temp_path):
+                os.remove(temp_path)
+    def test_supported_formats(self):
+        """Test that all expected formats are supported."""
+        validator = FileValidator()
+        expected_formats = {'.csv', '.xlsx', '.xls', '.parquet', '.json', '.tsv'}
+        assert validator.SUPPORTED_FORMATS == expected_formats
+class TestDataFrameValidator:
+    """Test suite for DataFrameValidator class."""
+    def test_validate_valid_dataframe(self, sample_dataframe):
+        """Test validation of valid DataFrame."""
+        validator = DataFrameValidator()
+        assert validator.validate(sample_dataframe) is True
+    def test_validate_empty_dataframe(self, empty_dataframe):
+        """Test validation of empty DataFrame."""
+        validator = DataFrameValidator()
+        with pytest.raises(ValueError, match="DataFrame is empty"):
+            validator.validate(empty_dataframe)
+    def test_validate_none_dataframe(self):
+        """Test validation of None DataFrame."""
+        validator = DataFrameValidator()
+        with pytest.raises(ValueError, match="DataFrame cannot be None"):
+            validator.validate(None)
+    def test_validate_wrong_type(self):
+        """Test validation of wrong data type."""
+        validator = DataFrameValidator()
+        with pytest.raises(ValueError, match="Expected pandas DataFrame"):
+            validator.validate([1, 2, 3])
+class TestColumnValidator:
+    """Test suite for ColumnValidator class."""
+    def test_validate_existing_column(self, sample_dataframe):
+        """Test validation of existing column."""
+        validator = ColumnValidator()
+        assert validator.validate(sample_dataframe, 'age') is True
+    def test_validate_existing_columns_list(self, sample_dataframe):
+        """Test validation of multiple existing columns."""
+        validator = ColumnValidator()
+        assert validator.validate(sample_dataframe, ['age', 'name']) is True
+    def test_validate_missing_column(self, sample_dataframe):
+        """Test validation of missing column."""
+        validator = ColumnValidator()
+        with pytest.raises(ValueError, match="Columns not found"):
+            validator.validate(sample_dataframe, 'nonexistent')
+    def test_validate_partial_missing_columns(self, sample_dataframe):
+        """Test validation with some missing columns."""
+        validator = ColumnValidator()
+        with pytest.raises(ValueError, match="Columns not found"):
+            validator.validate(sample_dataframe, ['age', 'nonexistent'])
+# ============================================================================
+# FORMATTING FUNCTION TESTS
+# ============================================================================
+class TestFormatNumber:
+    """Test suite for format_number function."""
+    def test_format_integer(self):
+        """Test formatting integer."""
+        assert format_number(1234567) == "1,234,567"
+    def test_format_float(self):
+        """Test formatting float."""
+        assert format_number(1234567.89) == "1,234,567.89"
+    def test_format_with_decimals(self):
+        """Test formatting with specific decimal places."""
+        assert format_number(1234.5678, decimals=3) == "1,234.568"
+    def test_format_nan(self):
+        """Test formatting NaN value."""
+        assert format_number(np.nan) == "N/A"
+    def test_format_none(self):
+        """Test formatting None value."""
+        assert format_number(None) == "N/A"
+class TestFormatPercentage:
+    """Test suite for format_percentage function."""
+    def test_format_valid_percentage(self):
+        """Test formatting valid percentage."""
+        assert format_percentage(0.456) == "45.60%"
+    def test_format_zero_percentage(self):
+        """Test formatting zero percentage."""
+        assert format_percentage(0.0) == "0.00%"
+    def test_format_one_hundred_percent(self):
+        """Test formatting 100%."""
+        assert format_percentage(1.0) == "100.00%"
+    def test_format_nan_percentage(self):
+        """Test formatting NaN percentage."""
+        assert format_percentage(np.nan) == "N/A"
+    def test_format_custom_decimals(self):
+        """Test formatting with custom decimal places."""
+        assert format_percentage(0.12345, decimals=3) == "12.345%"
+class TestSafeDivide:
+    """Test suite for safe_divide function."""
+    def test_normal_division(self):
+        """Test normal division."""
+        assert safe_divide(10, 2) == 5.0
+    def test_division_by_zero(self):
+        """Test division by zero returns default."""
+        assert safe_divide(10, 0, default=0.0) == 0.0
+    def test_division_by_nan(self):
+        """Test division by NaN returns default."""
+        assert safe_divide(10, np.nan, default=-1.0) == -1.0
+    def test_custom_default(self):
+        """Test custom default value."""
+        assert safe_divide(10, 0, default=999) == 999
+# ============================================================================
+# DATA ANALYSIS FUNCTION TESTS
+# ============================================================================
+class TestGetColumnTypes:
+    """Test suite for get_column_types function."""
+    def test_mixed_types(self, sample_dataframe):
+        """Test getting column types from mixed DataFrame."""
+        types = get_column_types(sample_dataframe)
+        assert 'age' in types['numerical']
+        assert 'salary' in types['numerical']
+        assert 'name' in types['categorical']
+        assert 'date' in types['datetime']
+    def test_only_numerical(self):
+        """Test DataFrame with only numerical columns."""
+        df = pd.DataFrame({'a': [1, 2], 'b': [3.0, 4.0]})
+        types = get_column_types(df)
+        assert len(types['numerical']) == 2
+        assert len(types['categorical']) == 0
+    def test_only_categorical(self):
+        """Test DataFrame with only categorical columns."""
+        df = pd.DataFrame({'a': ['x', 'y'], 'b': ['z', 'w']})
+        types = get_column_types(df)
+        assert len(types['categorical']) == 2
+        assert len(types['numerical']) == 0
+class TestDetectDateColumns:
+    """Test suite for detect_date_columns function."""
+    def test_detect_date_string_column(self):
+        """Test detecting date strings."""
+        df = pd.DataFrame({
+            'date_col': ['2024-01-01', '2024-01-02', '2024-01-03'],
+            'text_col': ['abc', 'def', 'ghi']
+        })
+        date_cols = detect_date_columns(df)
+        assert 'date_col' in date_cols
+        assert 'text_col' not in date_cols
+    def test_no_date_columns(self):
+        """Test DataFrame without date columns."""
+        df = pd.DataFrame({
+            'num': [1, 2, 3],
+            'text': ['a', 'b', 'c']
+        })
+        date_cols = detect_date_columns(df)
+        assert len(date_cols) == 0
+class TestCleanCurrencyColumn:
+    """Test suite for clean_currency_column function."""
+    def test_clean_dollar_signs(self):
+        """Test cleaning dollar signs."""
+        s = pd.Series(['$1,234.56', '$789.00', '$1,000.00'])
+        result = clean_currency_column(s)
+        expected = pd.Series([1234.56, 789.00, 1000.00])
+        pd.testing.assert_series_equal(result, expected)
+    def test_clean_spaces(self):
+        """Test cleaning spaces in currency."""
+        s = pd.Series(['$966 ', '$193 '])
+        result = clean_currency_column(s)
+        assert result[0] == 966.0
+        assert result[1] == 193.0
+    def test_handle_invalid_values(self):
+        """Test handling invalid currency values."""
+        s = pd.Series(['$100', 'invalid', '$200'])
+        result = clean_currency_column(s)
+        assert result[0] == 100.0
+        assert pd.isna(result[1])
+        assert result[2] == 200.0
+class TestTruncateString:
+    """Test suite for truncate_string function."""
+    def test_truncate_long_string(self):
+        """Test truncating long string."""
+        text = "This is a very long text that needs truncation"
+        result = truncate_string(text, max_length=20)
+        assert len(result) == 20
+        assert result.endswith("...")
+    def test_no_truncation_needed(self):
+        """Test string that doesn't need truncation."""
+        text = "Short text"
+        result = truncate_string(text, max_length=20)
+        assert result == text
+    def test_custom_suffix(self):
+        """Test custom truncation suffix."""
+        text = "Long text here"
+        result = truncate_string(text, max_length=10, suffix=">>")
+        assert result.endswith(">>")
+class TestGetMemoryUsage:
+    """Test suite for get_memory_usage function."""
+    def test_small_dataframe(self):
+        """Test memory usage of small DataFrame."""
+        df = pd.DataFrame({'a': [1, 2, 3]})
+        usage = get_memory_usage(df)
+        assert 'B' in usage or 'KB' in usage
+    def test_returns_string(self, sample_dataframe):
+        """Test that function returns string."""
+        usage = get_memory_usage(sample_dataframe)
+        assert isinstance(usage, str)
+# ============================================================================
+# EXPORTER TESTS
+# ============================================================================
+class TestCSVExporter:
+    """Test suite for CSVExporter class."""
+    def test_export_csv(self, sample_dataframe):
+        """Test exporting DataFrame to CSV."""
+        exporter = CSVExporter()
+        temp_path = tempfile.mktemp(suffix='.csv')
+        try:
+            result = exporter.export(sample_dataframe, temp_path)
+            assert result is True
+            assert os.path.exists(temp_path)
+            # Verify content
+            df_loaded = pd.read_csv(temp_path)
+            assert df_loaded.shape == sample_dataframe.shape
+        finally:
+            if os.path.exists(temp_path):
+                os.remove(temp_path)
+class TestExcelExporter:
+    """Test suite for ExcelExporter class."""
+    def test_export_excel(self, sample_dataframe):
+        """Test exporting DataFrame to Excel."""
+        exporter = ExcelExporter()
+        temp_path = tempfile.mktemp(suffix='.xlsx')
+        try:
+            # Remove datetime column for Excel compatibility
+            df_test = sample_dataframe.drop('date', axis=1)
+            result = exporter.export(df_test, temp_path)
+            assert result is True
+            assert os.path.exists(temp_path)
+            # Verify content
+            df_loaded = pd.read_excel(temp_path)
+            assert df_loaded.shape == df_test.shape
+        finally:
+            if os.path.exists(temp_path):
+                os.remove(temp_path)
+# ============================================================================
+# CONFIG TESTS
+# ============================================================================
+class TestConfig:
+    """Test suite for Config class."""
+    def test_supported_formats_exists(self):
+        """Test that supported formats are defined."""
+        assert hasattr(Config, 'SUPPORTED_FILE_FORMATS')
+        assert len(Config.SUPPORTED_FILE_FORMATS) > 0
+    def test_display_settings_exist(self):
+        """Test that display settings are defined."""
+        assert hasattr(Config, 'MAX_DISPLAY_ROWS')
+        assert hasattr(Config, 'MAX_STRING_LENGTH')
+        assert hasattr(Config, 'DEFAULT_DECIMAL_PLACES')
+    def test_config_values_valid(self):
+        """Test that config values are valid."""
+        assert Config.MAX_DISPLAY_ROWS > 0
+        assert Config.MAX_STRING_LENGTH > 0
+        assert Config.DEFAULT_DECIMAL_PLACES >= 0
+# ============================================================================
+# RUN TESTS
+# ============================================================================
+if __name__ == "__main__":
+    pytest.main([__file__, "-v", "--tb=short"])

tests/test_visualizations.py ADDED Viewed

	@@ -0,0 +1,665 @@

+"""
+Unit Tests for Visualizations Module
+Comprehensive tests for all visualization strategies and the visualization manager.
+Author: Craig
+Date: December 2024
+"""
+import pytest
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+from pathlib import Path
+import tempfile
+import os
+from visualizations import (
+    VisualizationStrategy, TimeSeriesPlot, DistributionPlot,
+    CategoryPlot, ScatterPlot, CorrelationHeatmap,
+    VisualizationManager, save_visualization
+)
+# ============================================================================
+# FIXTURES
+# ============================================================================
+@pytest.fixture
+def time_series_data():
+    """Create sample time series data."""
+    dates = pd.date_range('2024-01-01', periods=100, freq='D')
+    return pd.DataFrame({
+        'date': dates,
+        'sales': np.random.randint(100, 1000, 100),
+        'revenue': np.random.uniform(1000, 5000, 100)
+    })
+@pytest.fixture
+def numerical_data():
+    """Create sample numerical data."""
+    np.random.seed(42)
+    return pd.DataFrame({
+        'values': np.random.normal(100, 15, 1000),
+        'scores': np.random.uniform(0, 100, 1000)
+    })
+@pytest.fixture
+def categorical_data():
+    """Create sample categorical data."""
+    return pd.DataFrame({
+        'category': ['A', 'B', 'C', 'D', 'E'] * 20,
+        'values': np.random.randint(10, 100, 100),
+        'region': np.random.choice(['North', 'South', 'East', 'West'], 100)
+    })
+@pytest.fixture
+def scatter_data():
+    """Create sample scatter plot data."""
+    np.random.seed(42)
+    x = np.random.uniform(0, 100, 200)
+    y = 2 * x + np.random.normal(0, 10, 200)
+    return pd.DataFrame({
+        'x_val': x,
+        'y_val': y,
+        'category': np.random.choice(['A', 'B', 'C'], 200),
+        'size': np.random.uniform(10, 100, 200)
+    })
+@pytest.fixture
+def correlation_data():
+    """Create sample data for correlation."""
+    np.random.seed(42)
+    return pd.DataFrame({
+        'var1': np.random.normal(50, 10, 100),
+        'var2': np.random.normal(100, 20, 100),
+        'var3': np.random.normal(75, 15, 100),
+        'var4': np.random.normal(60, 12, 100)
+    })
+# ============================================================================
+# TIME SERIES PLOT TESTS
+# ============================================================================
+class TestTimeSeriesPlot:
+    """Test suite for TimeSeriesPlot class."""
+    def test_initialization(self):
+        """Test TimeSeriesPlot initialization."""
+        plot = TimeSeriesPlot()
+        assert plot is not None
+    def test_get_required_params(self):
+        """Test getting required parameters."""
+        plot = TimeSeriesPlot()
+        params = plot.get_required_params()
+        assert 'date_column' in params
+        assert 'value_column' in params
+    def test_create_matplotlib_basic(self, time_series_data):
+        """Test creating basic matplotlib time series plot."""
+        plot = TimeSeriesPlot()
+        fig = plot.create(time_series_data,
+                          date_column='date',
+                          value_column='sales',
+                          backend='matplotlib')
+        assert fig is not None
+        assert hasattr(fig, 'savefig')
+        plt.close(fig)
+    def test_create_plotly_basic(self, time_series_data):
+        """Test creating basic plotly time series plot."""
+        plot = TimeSeriesPlot()
+        fig = plot.create(time_series_data,
+                          date_column='date',
+                          value_column='sales',
+                          backend='plotly')
+        assert fig is not None
+        assert hasattr(fig, 'write_html')
+    def test_aggregation_sum(self, time_series_data):
+        """Test time series with sum aggregation."""
+        plot = TimeSeriesPlot()
+        fig = plot.create(time_series_data,
+                          date_column='date',
+                          value_column='sales',
+                          aggregation='sum',
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_aggregation_mean(self, time_series_data):
+        """Test time series with mean aggregation."""
+        plot = TimeSeriesPlot()
+        fig = plot.create(time_series_data,
+                          date_column='date',
+                          value_column='sales',
+                          aggregation='mean',
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_invalid_date_column(self, time_series_data):
+        """Test with invalid date column."""
+        plot = TimeSeriesPlot()
+        with pytest.raises(ValueError):
+            plot.create(time_series_data,
+                        date_column='nonexistent',
+                        value_column='sales')
+    def test_invalid_backend(self, time_series_data):
+        """Test with invalid backend."""
+        plot = TimeSeriesPlot()
+        with pytest.raises(ValueError, match="Unsupported backend"):
+            plot.create(time_series_data,
+                        date_column='date',
+                        value_column='sales',
+                        backend='invalid')
+# ============================================================================
+# DISTRIBUTION PLOT TESTS
+# ============================================================================
+class TestDistributionPlot:
+    """Test suite for DistributionPlot class."""
+    def test_initialization(self):
+        """Test DistributionPlot initialization."""
+        plot = DistributionPlot()
+        assert plot is not None
+    def test_get_required_params(self):
+        """Test getting required parameters."""
+        plot = DistributionPlot()
+        params = plot.get_required_params()
+        assert 'column' in params
+    def test_create_histogram_matplotlib(self, numerical_data):
+        """Test creating histogram with matplotlib."""
+        plot = DistributionPlot()
+        fig = plot.create(numerical_data,
+                          column='values',
+                          plot_type='histogram',
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_create_box_matplotlib(self, numerical_data):
+        """Test creating box plot with matplotlib."""
+        plot = DistributionPlot()
+        fig = plot.create(numerical_data,
+                          column='values',
+                          plot_type='box',
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_create_violin_matplotlib(self, numerical_data):
+        """Test creating violin plot with matplotlib."""
+        plot = DistributionPlot()
+        fig = plot.create(numerical_data,
+                          column='values',
+                          plot_type='violin',
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_create_histogram_plotly(self, numerical_data):
+        """Test creating histogram with plotly."""
+        plot = DistributionPlot()
+        fig = plot.create(numerical_data,
+                          column='values',
+                          plot_type='histogram',
+                          backend='plotly')
+        assert fig is not None
+    def test_custom_bins(self, numerical_data):
+        """Test histogram with custom bins."""
+        plot = DistributionPlot()
+        fig = plot.create(numerical_data,
+                          column='values',
+                          plot_type='histogram',
+                          bins=50,
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_invalid_column(self, numerical_data):
+        """Test with invalid column."""
+        plot = DistributionPlot()
+        with pytest.raises(ValueError):
+            plot.create(numerical_data, column='nonexistent')
+    def test_invalid_plot_type(self, numerical_data):
+        """Test with invalid plot type."""
+        plot = DistributionPlot()
+        with pytest.raises(ValueError, match="Unsupported plot type"):
+            plot.create(numerical_data,
+                        column='values',
+                        plot_type='invalid',
+                        backend='matplotlib')
+# ============================================================================
+# CATEGORY PLOT TESTS
+# ============================================================================
+class TestCategoryPlot:
+    """Test suite for CategoryPlot class."""
+    def test_initialization(self):
+        """Test CategoryPlot initialization."""
+        plot = CategoryPlot()
+        assert plot is not None
+    def test_get_required_params(self):
+        """Test getting required parameters."""
+        plot = CategoryPlot()
+        params = plot.get_required_params()
+        assert 'column' in params
+    def test_create_bar_matplotlib(self, categorical_data):
+        """Test creating bar chart with matplotlib."""
+        plot = CategoryPlot()
+        fig = plot.create(categorical_data,
+                          column='category',
+                          plot_type='bar',
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_create_pie_matplotlib(self, categorical_data):
+        """Test creating pie chart with matplotlib."""
+        plot = CategoryPlot()
+        fig = plot.create(categorical_data,
+                          column='category',
+                          plot_type='pie',
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_create_bar_plotly(self, categorical_data):
+        """Test creating bar chart with plotly."""
+        plot = CategoryPlot()
+        fig = plot.create(categorical_data,
+                          column='category',
+                          plot_type='bar',
+                          backend='plotly')
+        assert fig is not None
+    def test_aggregation_sum(self, categorical_data):
+        """Test with sum aggregation."""
+        plot = CategoryPlot()
+        fig = plot.create(categorical_data,
+                          column='category',
+                          value_column='values',
+                          aggregation='sum',
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_top_n_categories(self, categorical_data):
+        """Test showing only top N categories."""
+        plot = CategoryPlot()
+        fig = plot.create(categorical_data,
+                          column='category',
+                          top_n=3,
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_invalid_plot_type(self, categorical_data):
+        """Test with invalid plot type."""
+        plot = CategoryPlot()
+        with pytest.raises(ValueError, match="Unsupported plot type"):
+            plot.create(categorical_data,
+                        column='category',
+                        plot_type='invalid',
+                        backend='matplotlib')
+# ============================================================================
+# SCATTER PLOT TESTS
+# ============================================================================
+class TestScatterPlot:
+    """Test suite for ScatterPlot class."""
+    def test_initialization(self):
+        """Test ScatterPlot initialization."""
+        plot = ScatterPlot()
+        assert plot is not None
+    def test_get_required_params(self):
+        """Test getting required parameters."""
+        plot = ScatterPlot()
+        params = plot.get_required_params()
+        assert 'x_column' in params
+        assert 'y_column' in params
+    def test_create_basic_matplotlib(self, scatter_data):
+        """Test creating basic scatter plot with matplotlib."""
+        plot = ScatterPlot()
+        fig = plot.create(scatter_data,
+                          x_column='x_val',
+                          y_column='y_val',
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_create_basic_plotly(self, scatter_data):
+        """Test creating basic scatter plot with plotly."""
+        plot = ScatterPlot()
+        fig = plot.create(scatter_data,
+                          x_column='x_val',
+                          y_column='y_val',
+                          backend='plotly')
+        assert fig is not None
+    def test_with_color_column(self, scatter_data):
+        """Test scatter plot with color coding."""
+        plot = ScatterPlot()
+        fig = plot.create(scatter_data,
+                          x_column='x_val',
+                          y_column='y_val',
+                          color_column='category',
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_with_size_column(self, scatter_data):
+        """Test scatter plot with size coding."""
+        plot = ScatterPlot()
+        fig = plot.create(scatter_data,
+                          x_column='x_val',
+                          y_column='y_val',
+                          size_column='size',
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_with_trend_line(self, scatter_data):
+        """Test scatter plot with trend line."""
+        plot = ScatterPlot()
+        fig = plot.create(scatter_data,
+                          x_column='x_val',
+                          y_column='y_val',
+                          show_trend=True,
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_invalid_columns(self, scatter_data):
+        """Test with invalid columns."""
+        plot = ScatterPlot()
+        with pytest.raises(ValueError):
+            plot.create(scatter_data,
+                        x_column='nonexistent',
+                        y_column='y_val')
+# ============================================================================
+# CORRELATION HEATMAP TESTS
+# ============================================================================
+class TestCorrelationHeatmap:
+    """Test suite for CorrelationHeatmap class."""
+    def test_initialization(self):
+        """Test CorrelationHeatmap initialization."""
+        plot = CorrelationHeatmap()
+        assert plot is not None
+    def test_get_required_params(self):
+        """Test getting required parameters."""
+        plot = CorrelationHeatmap()
+        params = plot.get_required_params()
+        assert isinstance(params, list)
+    def test_create_matplotlib(self, correlation_data):
+        """Test creating correlation heatmap with matplotlib."""
+        plot = CorrelationHeatmap()
+        fig = plot.create(correlation_data, backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_create_plotly(self, correlation_data):
+        """Test creating correlation heatmap with plotly."""
+        plot = CorrelationHeatmap()
+        fig = plot.create(correlation_data, backend='plotly')
+        assert fig is not None
+    def test_with_specific_columns(self, correlation_data):
+        """Test heatmap with specific columns."""
+        plot = CorrelationHeatmap()
+        fig = plot.create(correlation_data,
+                          columns=['var1', 'var2', 'var3'],
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_spearman_correlation(self, correlation_data):
+        """Test with Spearman correlation."""
+        plot = CorrelationHeatmap()
+        fig = plot.create(correlation_data,
+                          method='spearman',
+                          backend='matplotlib')
+        assert fig is not None
+        plt.close(fig)
+    def test_insufficient_columns(self):
+        """Test with insufficient numerical columns."""
+        df = pd.DataFrame({'col1': [1, 2, 3]})
+        plot = CorrelationHeatmap()
+        with pytest.raises(ValueError, match="at least 2 numerical columns"):
+            plot.create(df)
+# ============================================================================
+# VISUALIZATION MANAGER TESTS
+# ============================================================================
+class TestVisualizationManager:
+    """Test suite for VisualizationManager class."""
+    def test_initialization(self):
+        """Test VisualizationManager initialization."""
+        manager = VisualizationManager()
+        assert manager is not None
+        assert len(manager.strategies) >= 5
+    def test_get_available_visualizations(self):
+        """Test getting available visualizations."""
+        manager = VisualizationManager()
+        available = manager.get_available_visualizations()
+        assert 'time_series' in available
+        assert 'distribution' in available
+        assert 'category' in available
+        assert 'scatter' in available
+        assert 'correlation' in available
+    def test_create_time_series(self, time_series_data):
+        """Test creating time series through manager."""
+        manager = VisualizationManager()
+        fig = manager.create_visualization(
+            'time_series',
+            time_series_data,
+            date_column='date',
+            value_column='sales',
+            backend='matplotlib'
+        )
+        assert fig is not None
+        plt.close(fig)
+    def test_create_distribution(self, numerical_data):
+        """Test creating distribution through manager."""
+        manager = VisualizationManager()
+        fig = manager.create_visualization(
+            'distribution',
+            numerical_data,
+            column='values',
+            backend='matplotlib'
+        )
+        assert fig is not None
+        plt.close(fig)
+    def test_create_category(self, categorical_data):
+        """Test creating category plot through manager."""
+        manager = VisualizationManager()
+        fig = manager.create_visualization(
+            'category',
+            categorical_data,
+            column='category',
+            backend='matplotlib'
+        )
+        assert fig is not None
+        plt.close(fig)
+    def test_create_scatter(self, scatter_data):
+        """Test creating scatter plot through manager."""
+        manager = VisualizationManager()
+        fig = manager.create_visualization(
+            'scatter',
+            scatter_data,
+            x_column='x_val',
+            y_column='y_val',
+            backend='matplotlib'
+        )
+        assert fig is not None
+        plt.close(fig)
+    def test_create_correlation(self, correlation_data):
+        """Test creating correlation heatmap through manager."""
+        manager = VisualizationManager()
+        fig = manager.create_visualization(
+            'correlation',
+            correlation_data,
+            backend='matplotlib'
+        )
+        assert fig is not None
+        plt.close(fig)
+    def test_unsupported_visualization_type(self, numerical_data):
+        """Test with unsupported visualization type."""
+        manager = VisualizationManager()
+        with pytest.raises(ValueError, match="Unsupported visualization type"):
+            manager.create_visualization('invalid_type', numerical_data)
+    def test_add_strategy(self):
+        """Test adding new strategy."""
+        manager = VisualizationManager()
+        initial_count = len(manager.strategies)
+        # Create mock strategy
+        class MockStrategy(VisualizationStrategy):
+            def create(self, df, **kwargs):
+                return None
+            def get_required_params(self):
+                return []
+        manager.add_strategy('mock', MockStrategy())
+        assert len(manager.strategies) == initial_count + 1
+        assert 'mock' in manager.get_available_visualizations()
+    def test_get_required_params(self):
+        """Test getting required params for visualization type."""
+        manager = VisualizationManager()
+        params = manager.get_required_params('time_series')
+        assert isinstance(params, list)
+        assert 'date_column' in params
+        assert 'value_column' in params
+    def test_get_required_params_invalid_type(self):
+        """Test getting params for invalid type."""
+        manager = VisualizationManager()
+        with pytest.raises(ValueError):
+            manager.get_required_params('invalid_type')
+# ============================================================================
+# SAVE VISUALIZATION TESTS
+# ============================================================================
+class TestSaveVisualization:
+    """Test suite for save_visualization function."""
+    def test_save_matplotlib_png(self, numerical_data):
+        """Test saving matplotlib figure as PNG."""
+        plot = DistributionPlot()
+        fig = plot.create(numerical_data, column='values', backend='matplotlib')
+        temp_path = tempfile.mktemp(suffix='.png')
+        try:
+            result = save_visualization(fig, temp_path, format='png')
+            assert result is True
+            assert os.path.exists(temp_path)
+        finally:
+            plt.close(fig)
+            if os.path.exists(temp_path):
+                os.remove(temp_path)
+    def test_save_matplotlib_pdf(self, numerical_data):
+        """Test saving matplotlib figure as PDF."""
+        plot = DistributionPlot()
+        fig = plot.create(numerical_data, column='values', backend='matplotlib')
+        temp_path = tempfile.mktemp(suffix='.pdf')
+        try:
+            result = save_visualization(fig, temp_path, format='pdf')
+            assert result is True
+            assert os.path.exists(temp_path)
+        finally:
+            plt.close(fig)
+            if os.path.exists(temp_path):
+                os.remove(temp_path)
+# ============================================================================
+# RUN TESTS
+# ============================================================================
+if __name__ == "__main__":
+    pytest.main([__file__, "-v", "--tb=short"])

utils.py ADDED Viewed

	@@ -0,0 +1,480 @@

+"""
+Utility Module for Business Intelligence Dashboard
+This module provides helper functions and utilities following SOLID principles.
+Implements Single Responsibility Principle - each function has one clear purpose.
+Author: Craig
+Date: December 2024
+"""
+import pandas as pd
+import numpy as np
+from pathlib import Path
+from typing import Union, Optional, List, Any
+import logging
+from abc import ABC, abstractmethod
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# ============================================================================
+# INTERFACE SEGREGATION PRINCIPLE (ISP)
+# Define specific interfaces for different validation types
+# ============================================================================
+class DataValidator(ABC):
+    """
+    Abstract base class for data validation.
+    Follows Interface Segregation Principle - clients depend only on methods they use.
+    """
+    @abstractmethod
+    def validate(self, data: Any) -> bool:
+        """
+        Validate the given data.
+        Args:
+            data: Data to validate
+        Returns:
+            bool: True if validation passes, False otherwise
+        """
+        pass
+class FileValidator(DataValidator):
+    """
+    Validates file existence and format.
+    Follows Single Responsibility Principle - only handles file validation.
+    """
+    SUPPORTED_FORMATS = {'.csv', '.xlsx', '.xls', '.parquet', '.json', '.tsv'}
+    def validate(self, file_path: Union[str, Path]) -> bool:
+        """
+        Validate if file exists and has supported format.
+        Args:
+            file_path: Path to the file
+        Returns:
+            bool: True if file is valid, False otherwise
+        Raises:
+            FileNotFoundError: If file doesn't exist
+            ValueError: If file format is not supported
+        """
+        path = Path(file_path)
+        if not path.exists():
+            logger.error(f"File not found: {file_path}")
+            raise FileNotFoundError(f"File not found: {file_path}")
+        if path.suffix.lower() not in self.SUPPORTED_FORMATS:
+            logger.error(f"Unsupported format: {path.suffix}")
+            raise ValueError(
+                f"Unsupported file format: {path.suffix}. "
+                f"Supported formats: {', '.join(self.SUPPORTED_FORMATS)}"
+            )
+        logger.info(f"File validation passed: {file_path}")
+        return True
+class DataFrameValidator(DataValidator):
+    """
+    Validates pandas DataFrame properties.
+    Follows Single Responsibility Principle - only handles DataFrame validation.
+    """
+    def validate(self, df: pd.DataFrame) -> bool:
+        """
+        Validate if DataFrame is valid and not empty.
+        Args:
+            df: DataFrame to validate
+        Returns:
+            bool: True if DataFrame is valid, False otherwise
+        Raises:
+            ValueError: If DataFrame is None or empty
+        """
+        if df is None:
+            logger.error("DataFrame is None")
+            raise ValueError("DataFrame cannot be None")
+        if not isinstance(df, pd.DataFrame):
+            logger.error(f"Expected DataFrame, got {type(df)}")
+            raise ValueError(f"Expected pandas DataFrame, got {type(df)}")
+        if df.empty:
+            logger.error("DataFrame is empty")
+            raise ValueError("DataFrame is empty")
+        logger.info(f"DataFrame validation passed: {df.shape[0]} rows, {df.shape[1]} columns")
+        return True
+class ColumnValidator(DataValidator):
+    """
+    Validates column existence in DataFrame.
+    Follows Single Responsibility Principle - only handles column validation.
+    """
+    def validate(self, df: pd.DataFrame, columns: Union[str, List[str]]) -> bool:
+        """
+        Validate if specified columns exist in DataFrame.
+        Args:
+            df: DataFrame to check
+            columns: Column name(s) to validate
+        Returns:
+            bool: True if all columns exist, False otherwise
+        Raises:
+            ValueError: If any column doesn't exist
+        """
+        if isinstance(columns, str):
+            columns = [columns]
+        missing_columns = [col for col in columns if col not in df.columns]
+        if missing_columns:
+            logger.error(f"Missing columns: {missing_columns}")
+            raise ValueError(
+                f"Columns not found in DataFrame: {', '.join(missing_columns)}"
+            )
+        logger.info(f"Column validation passed: {columns}")
+        return True
+# ============================================================================
+# UTILITY FUNCTIONS
+# These follow Single Responsibility Principle
+# ============================================================================
+def format_number(number: Union[int, float], decimals: int = 2) -> str:
+    """
+    Format a number for display with thousand separators.
+    Args:
+        number: Number to format
+        decimals: Number of decimal places
+    Returns:
+        str: Formatted number string
+    Example:
+        >>> format_number(1234567.89)
+        '1,234,567.89'
+    """
+    try:
+        if pd.isna(number):
+            return "N/A"
+        if isinstance(number, (int, np.integer)):
+            return f"{number:,}"
+        return f"{number:,.{decimals}f}"
+    except (ValueError, TypeError) as e:
+        logger.warning(f"Error formatting number {number}: {e}")
+        return str(number)
+def format_percentage(value: float, decimals: int = 2) -> str:
+    """
+    Format a value as percentage.
+    Args:
+        value: Value to format (0.5 = 50%)
+        decimals: Number of decimal places
+    Returns:
+        str: Formatted percentage string
+    Example:
+        >>> format_percentage(0.456)
+        '45.60%'
+    """
+    try:
+        if pd.isna(value):
+            return "N/A"
+        return f"{value * 100:.{decimals}f}%"
+    except (ValueError, TypeError) as e:
+        logger.warning(f"Error formatting percentage {value}: {e}")
+        return str(value)
+def safe_divide(numerator: float, denominator: float, default: float = 0.0) -> float:
+    """
+    Safely divide two numbers, returning default if division by zero.
+    Args:
+        numerator: Numerator value
+        denominator: Denominator value
+        default: Default value to return if division fails
+    Returns:
+        float: Result of division or default value
+    Example:
+        >>> safe_divide(10, 2)
+        5.0
+        >>> safe_divide(10, 0, default=0)
+        0.0
+    """
+    try:
+        if denominator == 0 or pd.isna(denominator):
+            return default
+        return numerator / denominator
+    except (ValueError, TypeError, ZeroDivisionError):
+        return default
+def get_column_types(df: pd.DataFrame) -> dict:
+    """
+    Categorize DataFrame columns by data type.
+    Args:
+        df: DataFrame to analyze
+    Returns:
+        dict: Dictionary with keys 'numerical', 'categorical', 'datetime'
+    Example:
+        >>> df = pd.DataFrame({'age': [25, 30], 'name': ['Alice', 'Bob']})
+        >>> types = get_column_types(df)
+        >>> types['numerical']
+        ['age']
+    """
+    numerical = df.select_dtypes(include=[np.number]).columns.tolist()
+    categorical = df.select_dtypes(include=['object', 'category']).columns.tolist()
+    datetime = df.select_dtypes(include=['datetime64']).columns.tolist()
+    return {
+        'numerical': numerical,
+        'categorical': categorical,
+        'datetime': datetime
+    }
+def detect_date_columns(df: pd.DataFrame, sample_size: int = 100) -> List[str]:
+    """
+    Detect columns that might contain date strings.
+    Args:
+        df: DataFrame to analyze
+        sample_size: Number of rows to sample for detection
+    Returns:
+        List[str]: List of potential date column names
+    """
+    potential_date_cols = []
+    for col in df.select_dtypes(include=['object']).columns:
+        sample = df[col].dropna().head(sample_size)
+        if len(sample) == 0:
+            continue
+        # Try to parse as dates
+        try:
+            pd.to_datetime(sample, errors='coerce')
+            # If more than 50% parse successfully, consider it a date column
+            parsed = pd.to_datetime(sample, errors='coerce')
+            if parsed.notna().sum() / len(sample) > 0.5:
+                potential_date_cols.append(col)
+        except Exception:
+            continue
+    return potential_date_cols
+def clean_currency_column(series: pd.Series) -> pd.Series:
+    """
+    Clean currency columns by removing symbols and converting to float.
+    Args:
+        series: Pandas Series with currency values
+    Returns:
+        pd.Series: Cleaned numeric series
+    Example:
+        >>> s = pd.Series(['$1,234.56', '$789.00'])
+        >>> clean_currency_column(s)
+        0    1234.56
+        1     789.00
+        dtype: float64
+    """
+    try:
+        # Remove currency symbols, commas, and spaces
+        cleaned = series.astype(str).str.replace(r'[$,€£¥\s]', '', regex=True)
+        return pd.to_numeric(cleaned, errors='coerce')
+    except Exception as e:
+        logger.warning(f"Error cleaning currency column: {e}")
+        return series
+def truncate_string(text: str, max_length: int = 50, suffix: str = "...") -> str:
+    """
+    Truncate a string to maximum length.
+    Args:
+        text: Text to truncate
+        max_length: Maximum length
+        suffix: Suffix to add if truncated
+    Returns:
+        str: Truncated string
+    Example:
+        >>> truncate_string("This is a very long text", 10)
+        'This is...'
+    """
+    if not isinstance(text, str):
+        text = str(text)
+    if len(text) <= max_length:
+        return text
+    return text[:max_length - len(suffix)] + suffix
+def get_memory_usage(df: pd.DataFrame) -> str:
+    """
+    Get human-readable memory usage of DataFrame.
+    Args:
+        df: DataFrame to analyze
+    Returns:
+        str: Memory usage string (e.g., "2.5 MB")
+    """
+    memory_bytes = df.memory_usage(deep=True).sum()
+    for unit in ['B', 'KB', 'MB', 'GB']:
+        if memory_bytes < 1024.0:
+            return f"{memory_bytes:.2f} {unit}"
+        memory_bytes /= 1024.0
+    return f"{memory_bytes:.2f} TB"
+# ============================================================================
+# EXPORT UTILITIES
+# These follow Single Responsibility Principle
+# ============================================================================
+class DataExporter(ABC):
+    """
+    Abstract base class for data export.
+    Follows Open/Closed Principle - open for extension, closed for modification.
+    """
+    @abstractmethod
+    def export(self, data: Any, filepath: Union[str, Path]) -> bool:
+        """
+        Export data to file.
+        Args:
+            data: Data to export
+            filepath: Destination file path
+        Returns:
+            bool: True if export successful, False otherwise
+        """
+        pass
+class CSVExporter(DataExporter):
+    """
+    Export DataFrame to CSV format.
+    Follows Single Responsibility Principle.
+    """
+    def export(self, df: pd.DataFrame, filepath: Union[str, Path]) -> bool:
+        """
+        Export DataFrame to CSV file.
+        Args:
+            df: DataFrame to export
+            filepath: Destination CSV file path
+        Returns:
+            bool: True if export successful, False otherwise
+        """
+        try:
+            df.to_csv(filepath, index=False)
+            logger.info(f"Successfully exported to CSV: {filepath}")
+            return True
+        except Exception as e:
+            logger.error(f"Error exporting to CSV: {e}")
+            return False
+class ExcelExporter(DataExporter):
+    """
+    Export DataFrame to Excel format.
+    Follows Single Responsibility Principle.
+    """
+    def export(self, df: pd.DataFrame, filepath: Union[str, Path]) -> bool:
+        """
+        Export DataFrame to Excel file.
+        Args:
+            df: DataFrame to export
+            filepath: Destination Excel file path
+        Returns:
+            bool: True if export successful, False otherwise
+        """
+        try:
+            df.to_excel(filepath, index=False, engine='openpyxl')
+            logger.info(f"Successfully exported to Excel: {filepath}")
+            return True
+        except Exception as e:
+            logger.error(f"Error exporting to Excel: {e}")
+            return False
+# ============================================================================
+# CONSTANTS
+# Centralized configuration following DRY principle
+# ============================================================================
+class Config:
+    """
+    Configuration constants for the application.
+    Centralized configuration following Single Responsibility Principle.
+    """
+    # File formats
+    SUPPORTED_FILE_FORMATS = {'.csv', '.xlsx', '.xls', '.parquet', '.json', '.tsv'}
+    # Display settings
+    MAX_DISPLAY_ROWS = 100
+    MAX_STRING_LENGTH = 50
+    DEFAULT_DECIMAL_PLACES = 2
+    # Analysis settings
+    CORRELATION_THRESHOLD = 0.7
+    OUTLIER_ZSCORE_THRESHOLD = 3
+    MIN_SAMPLE_SIZE = 30
+    # Export settings
+    DEFAULT_EXPORT_FORMAT = 'csv'
+    EXPORT_TIMESTAMP_FORMAT = '%Y%m%d_%H%M%S'
+if __name__ == "__main__":
+    # Example usage and testing
+    print("Utils module loaded successfully")
+    print(f"Supported formats: {Config.SUPPORTED_FILE_FORMATS}")

visualizations.py ADDED Viewed

	@@ -0,0 +1,760 @@

+"""
+Visualizations Module for Business Intelligence Dashboard
+This module handles all data visualization operations using Strategy Pattern.
+Supports multiple chart types with flexible rendering backends.
+Author: Craig
+Date: December 2024
+"""
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+import plotly.express as px
+import plotly.graph_objects as go
+from typing import Union, List, Dict, Optional, Any, Tuple
+from abc import ABC, abstractmethod
+import logging
+from pathlib import Path
+from utils import ColumnValidator, DataFrameValidator, format_number, Config
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Set style for matplotlib
+plt.style.use('seaborn-v0_8-darkgrid')
+sns.set_palette("husl")
+# ============================================================================
+# STRATEGY PATTERN - Visualization Strategies
+# Follows Open/Closed Principle and Strategy Pattern
+# ============================================================================
+class VisualizationStrategy(ABC):
+    """
+    Abstract base class for visualization strategies.
+    Follows Strategy Pattern - allows different visualization algorithms.
+    """
+    @abstractmethod
+    def create(self, df: pd.DataFrame, **kwargs) -> Any:
+        """
+        Create visualization.
+        Args:
+            df: DataFrame to visualize
+            **kwargs: Additional parameters for visualization
+        Returns:
+            Visualization object (matplotlib Figure or plotly Figure)
+        """
+        pass
+    @abstractmethod
+    def get_required_params(self) -> List[str]:
+        """
+        Get list of required parameters for this visualization.
+        Returns:
+            List of required parameter names
+        """
+        pass
+# ============================================================================
+# TIME SERIES VISUALIZATIONS
+# ============================================================================
+class TimeSeriesPlot(VisualizationStrategy):
+    """
+    Create time series line plots.
+    Follows Single Responsibility Principle - only handles time series plots.
+    """
+    def get_required_params(self) -> List[str]:
+        """Required parameters for time series plot."""
+        return ['date_column', 'value_column']
+    def create(self, df: pd.DataFrame, date_column: str, value_column: str,
+               title: str = "Time Series Plot",
+               aggregation: str = 'sum',
+               backend: str = 'matplotlib',
+               **kwargs) -> Any:
+        """
+        Create time series plot.
+        Args:
+            df: DataFrame with time series data
+            date_column: Column containing dates
+            value_column: Column containing values to plot
+            title: Plot title
+            aggregation: Aggregation method ('sum', 'mean', 'count', 'median')
+            backend: Visualization backend ('matplotlib' or 'plotly')
+            **kwargs: Additional plotting parameters
+        Returns:
+            matplotlib Figure or plotly Figure
+        """
+        # Validate inputs
+        DataFrameValidator().validate(df)
+        ColumnValidator().validate(df, [date_column, value_column])
+        # Prepare data
+        df_plot = df.copy()
+        # Ensure date column is datetime
+        if not pd.api.types.is_datetime64_any_dtype(df_plot[date_column]):
+            df_plot[date_column] = pd.to_datetime(df_plot[date_column], errors='coerce')
+        # Remove rows with NaT dates
+        df_plot = df_plot.dropna(subset=[date_column])
+        # Sort by date
+        df_plot = df_plot.sort_values(date_column)
+        # Apply aggregation if needed
+        if aggregation != 'none':
+            df_plot = self._apply_aggregation(df_plot, date_column, value_column, aggregation)
+        # Create visualization based on backend
+        if backend == 'matplotlib':
+            return self._create_matplotlib(df_plot, date_column, value_column, title, aggregation)
+        elif backend == 'plotly':
+            return self._create_plotly(df_plot, date_column, value_column, title, aggregation)
+        else:
+            raise ValueError(f"Unsupported backend: {backend}")
+    def _apply_aggregation(self, df: pd.DataFrame, date_column: str,
+                          value_column: str, aggregation: str) -> pd.DataFrame:
+        """Apply aggregation to time series data."""
+        if aggregation == 'sum':
+            return df.groupby(date_column)[value_column].sum().reset_index()
+        elif aggregation == 'mean':
+            return df.groupby(date_column)[value_column].mean().reset_index()
+        elif aggregation == 'count':
+            return df.groupby(date_column)[value_column].count().reset_index()
+        elif aggregation == 'median':
+            return df.groupby(date_column)[value_column].median().reset_index()
+        else:
+            return df
+    def _create_matplotlib(self, df: pd.DataFrame, date_column: str,
+                          value_column: str, title: str, aggregation: str):
+        """Create matplotlib time series plot."""
+        fig, ax = plt.subplots(figsize=(12, 6))
+        ax.plot(df[date_column], df[value_column], marker='o', linewidth=2, markersize=4)
+        ax.set_xlabel(date_column, fontsize=12)
+        ax.set_ylabel(f"{value_column} ({aggregation})", fontsize=12)
+        ax.set_title(title, fontsize=14, fontweight='bold')
+        ax.grid(True, alpha=0.3)
+        # Rotate x-axis labels
+        plt.xticks(rotation=45, ha='right')
+        plt.tight_layout()
+        logger.info(f"Created matplotlib time series plot: {title}")
+        return fig
+    def _create_plotly(self, df: pd.DataFrame, date_column: str,
+                      value_column: str, title: str, aggregation: str):
+        """Create plotly time series plot."""
+        fig = px.line(df, x=date_column, y=value_column,
+                     title=title,
+                     labels={value_column: f"{value_column} ({aggregation})"})
+        fig.update_traces(mode='lines+markers')
+        fig.update_layout(
+            xaxis_title=date_column,
+            yaxis_title=f"{value_column} ({aggregation})",
+            hovermode='x unified',
+            template='plotly_white'
+        )
+        logger.info(f"Created plotly time series plot: {title}")
+        return fig
+# ============================================================================
+# DISTRIBUTION VISUALIZATIONS
+# ============================================================================
+class DistributionPlot(VisualizationStrategy):
+    """
+    Create distribution plots (histogram, box plot, violin plot).
+    Follows Single Responsibility Principle - only handles distribution plots.
+    """
+    def get_required_params(self) -> List[str]:
+        """Required parameters for distribution plot."""
+        return ['column']
+    def create(self, df: pd.DataFrame, column: str,
+               plot_type: str = 'histogram',
+               title: str = "Distribution Plot",
+               bins: int = 30,
+               backend: str = 'matplotlib',
+               **kwargs) -> Any:
+        """
+        Create distribution plot.
+        Args:
+            df: DataFrame with data
+            column: Column to visualize
+            plot_type: Type of plot ('histogram', 'box', 'violin')
+            title: Plot title
+            bins: Number of bins for histogram
+            backend: Visualization backend ('matplotlib' or 'plotly')
+            **kwargs: Additional plotting parameters
+        Returns:
+            matplotlib Figure or plotly Figure
+        """
+        # Validate inputs
+        DataFrameValidator().validate(df)
+        ColumnValidator().validate(df, column)
+        # Remove NaN values
+        df_plot = df[column].dropna()
+        if len(df_plot) == 0:
+            raise ValueError(f"No valid data in column '{column}'")
+        # Create visualization based on backend
+        if backend == 'matplotlib':
+            return self._create_matplotlib(df_plot, column, plot_type, title, bins)
+        elif backend == 'plotly':
+            return self._create_plotly(df_plot, column, plot_type, title, bins)
+        else:
+            raise ValueError(f"Unsupported backend: {backend}")
+    def _create_matplotlib(self, data: pd.Series, column: str,
+                          plot_type: str, title: str, bins: int):
+        """Create matplotlib distribution plot."""
+        fig, ax = plt.subplots(figsize=(10, 6))
+        if plot_type == 'histogram':
+            ax.hist(data, bins=bins, edgecolor='black', alpha=0.7)
+            ax.set_ylabel('Frequency', fontsize=12)
+        elif plot_type == 'box':
+            ax.boxplot(data, vert=True)
+            ax.set_ylabel(column, fontsize=12)
+        elif plot_type == 'violin':
+            # Use seaborn for violin plot
+            sns.violinplot(y=data, ax=ax)
+            ax.set_ylabel(column, fontsize=12)
+        else:
+            raise ValueError(f"Unsupported plot type: {plot_type}")
+        ax.set_xlabel(column if plot_type == 'histogram' else '', fontsize=12)
+        ax.set_title(title, fontsize=14, fontweight='bold')
+        ax.grid(True, alpha=0.3, axis='y')
+        plt.tight_layout()
+        logger.info(f"Created matplotlib {plot_type} plot: {title}")
+        return fig
+    def _create_plotly(self, data: pd.Series, column: str,
+                      plot_type: str, title: str, bins: int):
+        """Create plotly distribution plot."""
+        if plot_type == 'histogram':
+            fig = px.histogram(data, x=data.values, nbins=bins, title=title,
+                             labels={'x': column, 'y': 'Frequency'})
+        elif plot_type == 'box':
+            fig = px.box(y=data.values, title=title, labels={'y': column})
+        elif plot_type == 'violin':
+            fig = px.violin(y=data.values, title=title, labels={'y': column})
+        else:
+            raise ValueError(f"Unsupported plot type: {plot_type}")
+        fig.update_layout(template='plotly_white')
+        logger.info(f"Created plotly {plot_type} plot: {title}")
+        return fig
+# ============================================================================
+# CATEGORY VISUALIZATIONS
+# ============================================================================
+class CategoryPlot(VisualizationStrategy):
+    """
+    Create category plots (bar chart, pie chart).
+    Follows Single Responsibility Principle - only handles category plots.
+    """
+    def get_required_params(self) -> List[str]:
+        """Required parameters for category plot."""
+        return ['column']
+    def create(self, df: pd.DataFrame, column: str,
+               value_column: Optional[str] = None,
+               plot_type: str = 'bar',
+               title: str = "Category Analysis",
+               aggregation: str = 'count',
+               top_n: Optional[int] = None,
+               backend: str = 'matplotlib',
+               **kwargs) -> Any:
+        """
+        Create category plot.
+        Args:
+            df: DataFrame with data
+            column: Categorical column to visualize
+            value_column: Optional value column for aggregation
+            plot_type: Type of plot ('bar' or 'pie')
+            title: Plot title
+            aggregation: Aggregation method ('count', 'sum', 'mean', 'median')
+            top_n: Show only top N categories
+            backend: Visualization backend ('matplotlib' or 'plotly')
+            **kwargs: Additional plotting parameters
+        Returns:
+            matplotlib Figure or plotly Figure
+        """
+        # Validate inputs
+        DataFrameValidator().validate(df)
+        ColumnValidator().validate(df, column)
+        if value_column:
+            ColumnValidator().validate(df, value_column)
+        # Prepare data
+        if value_column and aggregation != 'count':
+            # Aggregate by category
+            if aggregation == 'sum':
+                data = df.groupby(column)[value_column].sum()
+            elif aggregation == 'mean':
+                data = df.groupby(column)[value_column].mean()
+            elif aggregation == 'median':
+                data = df.groupby(column)[value_column].median()
+            else:
+                data = df[column].value_counts()
+        else:
+            # Simple count
+            data = df[column].value_counts()
+        # Get top N if specified
+        if top_n:
+            data = data.nlargest(top_n)
+        # Sort for better visualization
+        data = data.sort_values(ascending=False)
+        # Create visualization based on backend
+        if backend == 'matplotlib':
+            return self._create_matplotlib(data, column, plot_type, title, aggregation)
+        elif backend == 'plotly':
+            return self._create_plotly(data, column, plot_type, title, aggregation)
+        else:
+            raise ValueError(f"Unsupported backend: {backend}")
+    def _create_matplotlib(self, data: pd.Series, column: str,
+                          plot_type: str, title: str, aggregation: str):
+        """Create matplotlib category plot."""
+        fig, ax = plt.subplots(figsize=(10, 6))
+        if plot_type == 'bar':
+            bars = ax.bar(range(len(data)), data.values, edgecolor='black', alpha=0.7)
+            ax.set_xticks(range(len(data)))
+            ax.set_xticklabels(data.index, rotation=45, ha='right')
+            ax.set_xlabel(column, fontsize=12)
+            ax.set_ylabel(f'Value ({aggregation})', fontsize=12)
+            # Add value labels on bars
+            for i, (idx, val) in enumerate(data.items()):
+                ax.text(i, val, format_number(val), ha='center', va='bottom')
+        elif plot_type == 'pie':
+            wedges, texts, autotexts = ax.pie(data.values, labels=data.index,
+                                               autopct='%1.1f%%', startangle=90)
+            # Make percentage text more readable
+            for autotext in autotexts:
+                autotext.set_color('white')
+                autotext.set_fontweight('bold')
+        else:
+            raise ValueError(f"Unsupported plot type: {plot_type}")
+        ax.set_title(title, fontsize=14, fontweight='bold')
+        plt.tight_layout()
+        logger.info(f"Created matplotlib {plot_type} plot: {title}")
+        return fig
+    def _create_plotly(self, data: pd.Series, column: str,
+                      plot_type: str, title: str, aggregation: str):
+        """Create plotly category plot."""
+        if plot_type == 'bar':
+            fig = px.bar(x=data.index, y=data.values, title=title,
+                        labels={'x': column, 'y': f'Value ({aggregation})'})
+            fig.update_traces(text=data.values, textposition='outside')
+        elif plot_type == 'pie':
+            fig = px.pie(values=data.values, names=data.index, title=title)
+        else:
+            raise ValueError(f"Unsupported plot type: {plot_type}")
+        fig.update_layout(template='plotly_white')
+        logger.info(f"Created plotly {plot_type} plot: {title}")
+        return fig
+# ============================================================================
+# RELATIONSHIP VISUALIZATIONS
+# ============================================================================
+class ScatterPlot(VisualizationStrategy):
+    """
+    Create scatter plots to show relationships between variables.
+    Follows Single Responsibility Principle - only handles scatter plots.
+    """
+    def get_required_params(self) -> List[str]:
+        """Required parameters for scatter plot."""
+        return ['x_column', 'y_column']
+    def create(self, df: pd.DataFrame, x_column: str, y_column: str,
+               title: str = "Scatter Plot",
+               color_column: Optional[str] = None,
+               size_column: Optional[str] = None,
+               show_trend: bool = False,
+               backend: str = 'matplotlib',
+               **kwargs) -> Any:
+        """
+        Create scatter plot.
+        Args:
+            df: DataFrame with data
+            x_column: Column for x-axis
+            y_column: Column for y-axis
+            title: Plot title
+            color_column: Optional column for color coding
+            size_column: Optional column for point sizes
+            show_trend: Whether to show trend line
+            backend: Visualization backend ('matplotlib' or 'plotly')
+            **kwargs: Additional plotting parameters
+        Returns:
+            matplotlib Figure or plotly Figure
+        """
+        # Validate inputs
+        DataFrameValidator().validate(df)
+        ColumnValidator().validate(df, [x_column, y_column])
+        if color_column:
+            ColumnValidator().validate(df, color_column)
+        if size_column:
+            ColumnValidator().validate(df, size_column)
+        # Remove rows with NaN in required columns
+        required_cols = [x_column, y_column]
+        if color_column:
+            required_cols.append(color_column)
+        if size_column:
+            required_cols.append(size_column)
+        df_plot = df[required_cols].dropna()
+        if len(df_plot) == 0:
+            raise ValueError("No valid data after removing NaN values")
+        # Create visualization based on backend
+        if backend == 'matplotlib':
+            return self._create_matplotlib(df_plot, x_column, y_column, title,
+                                          color_column, size_column, show_trend)
+        elif backend == 'plotly':
+            return self._create_plotly(df_plot, x_column, y_column, title,
+                                      color_column, size_column, show_trend)
+        else:
+            raise ValueError(f"Unsupported backend: {backend}")
+    def _create_matplotlib(self, df: pd.DataFrame, x_column: str, y_column: str,
+                          title: str, color_column: Optional[str],
+                          size_column: Optional[str], show_trend: bool):
+        """Create matplotlib scatter plot."""
+        fig, ax = plt.subplots(figsize=(10, 6))
+        # Prepare scatter parameters
+        scatter_kwargs = {'alpha': 0.6, 'edgecolors': 'black', 'linewidth': 0.5}
+        if size_column:
+            scatter_kwargs['s'] = df[size_column]
+        else:
+            scatter_kwargs['s'] = 50
+        if color_column:
+            # Check if color column is categorical (string type)
+            if df[color_column].dtype == 'object' or pd.api.types.is_categorical_dtype(df[color_column]):
+                # Convert categorical to numerical codes for matplotlib
+                categories = df[color_column].astype('category')
+                color_codes = categories.cat.codes
+                scatter = ax.scatter(df[x_column], df[y_column], c=color_codes,
+                                   cmap='viridis', **scatter_kwargs)
+                # Create custom legend
+                handles = []
+                for i, cat in enumerate(categories.cat.categories):
+                    handles.append(plt.Line2D([0], [0], marker='o', color='w',
+                                             markerfacecolor=plt.cm.viridis(i / len(categories.cat.categories)),
+                                             markersize=8, label=cat))
+                ax.legend(handles=handles, title=color_column)
+            else:
+                # Numerical color column
+                scatter = ax.scatter(df[x_column], df[y_column], c=df[color_column],
+                                   cmap='viridis', **scatter_kwargs)
+                plt.colorbar(scatter, ax=ax, label=color_column)
+        else:
+            ax.scatter(df[x_column], df[y_column], **scatter_kwargs)
+        # Add trend line if requested
+        if show_trend:
+            z = np.polyfit(df[x_column], df[y_column], 1)
+            p = np.poly1d(z)
+            ax.plot(df[x_column], p(df[x_column]), "r--", alpha=0.8, label='Trend')
+            ax.legend()
+        ax.set_xlabel(x_column, fontsize=12)
+        ax.set_ylabel(y_column, fontsize=12)
+        ax.set_title(title, fontsize=14, fontweight='bold')
+        ax.grid(True, alpha=0.3)
+        plt.tight_layout()
+        logger.info(f"Created matplotlib scatter plot: {title}")
+        return fig
+    def _create_plotly(self, df: pd.DataFrame, x_column: str, y_column: str,
+                      title: str, color_column: Optional[str],
+                      size_column: Optional[str], show_trend: bool):
+        """Create plotly scatter plot."""
+        fig = px.scatter(df, x=x_column, y=y_column,
+                        color=color_column, size=size_column,
+                        title=title,
+                        trendline='ols' if show_trend else None)
+        fig.update_layout(template='plotly_white')
+        logger.info(f"Created plotly scatter plot: {title}")
+        return fig
+class CorrelationHeatmap(VisualizationStrategy):
+    """
+    Create correlation heatmap for numerical variables.
+    Follows Single Responsibility Principle - only handles correlation heatmaps.
+    """
+    def get_required_params(self) -> List[str]:
+        """Required parameters for correlation heatmap."""
+        return []  # Uses all numerical columns by default
+    def create(self, df: pd.DataFrame,
+               columns: Optional[List[str]] = None,
+               title: str = "Correlation Heatmap",
+               method: str = 'pearson',
+               backend: str = 'matplotlib',
+               **kwargs) -> Any:
+        """
+        Create correlation heatmap.
+        Args:
+            df: DataFrame with data
+            columns: Optional list of columns to include
+            title: Plot title
+            method: Correlation method ('pearson', 'spearman', 'kendall')
+            backend: Visualization backend ('matplotlib' or 'plotly')
+            **kwargs: Additional plotting parameters
+        Returns:
+            matplotlib Figure or plotly Figure
+        """
+        # Validate inputs
+        DataFrameValidator().validate(df)
+        # Select numerical columns
+        if columns:
+            ColumnValidator().validate(df, columns)
+            df_corr = df[columns].select_dtypes(include=[np.number])
+        else:
+            df_corr = df.select_dtypes(include=[np.number])
+        if df_corr.shape[1] < 2:
+            raise ValueError("Need at least 2 numerical columns for correlation heatmap")
+        # Calculate correlation
+        corr_matrix = df_corr.corr(method=method)
+        # Create visualization based on backend
+        if backend == 'matplotlib':
+            return self._create_matplotlib(corr_matrix, title)
+        elif backend == 'plotly':
+            return self._create_plotly(corr_matrix, title)
+        else:
+            raise ValueError(f"Unsupported backend: {backend}")
+    def _create_matplotlib(self, corr_matrix: pd.DataFrame, title: str):
+        """Create matplotlib correlation heatmap."""
+        fig, ax = plt.subplots(figsize=(10, 8))
+        # Create heatmap
+        sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm',
+                   center=0, square=True, linewidths=1,
+                   cbar_kws={"shrink": 0.8}, ax=ax)
+        ax.set_title(title, fontsize=14, fontweight='bold')
+        plt.tight_layout()
+        logger.info(f"Created matplotlib correlation heatmap: {title}")
+        return fig
+    def _create_plotly(self, corr_matrix: pd.DataFrame, title: str):
+        """Create plotly correlation heatmap."""
+        fig = px.imshow(corr_matrix,
+                       text_auto='.2f',
+                       color_continuous_scale='RdBu_r',
+                       title=title,
+                       aspect='auto')
+        fig.update_layout(template='plotly_white')
+        logger.info(f"Created plotly correlation heatmap: {title}")
+        return fig
+# ============================================================================
+# VISUALIZATION MANAGER
+# Uses Strategy Pattern to manage different visualization types
+# ============================================================================
+class VisualizationManager:
+    """
+    Manager class for visualizations using Strategy Pattern.
+    Follows Open/Closed Principle - open for extension, closed for modification.
+    """
+    def __init__(self):
+        """Initialize VisualizationManager with all available strategies."""
+        self.strategies: Dict[str, VisualizationStrategy] = {
+            'time_series': TimeSeriesPlot(),
+            'distribution': DistributionPlot(),
+            'category': CategoryPlot(),
+            'scatter': ScatterPlot(),
+            'correlation': CorrelationHeatmap()
+        }
+    def create_visualization(self, viz_type: str, df: pd.DataFrame, **kwargs) -> Any:
+        """
+        Create visualization using specified strategy.
+        Args:
+            viz_type: Type of visualization ('time_series', 'distribution', etc.)
+            df: DataFrame to visualize
+            **kwargs: Parameters specific to visualization type
+        Returns:
+            Visualization object
+        Raises:
+            ValueError: If visualization type is not supported
+        """
+        if viz_type not in self.strategies:
+            raise ValueError(
+                f"Unsupported visualization type: {viz_type}. "
+                f"Available types: {list(self.strategies.keys())}"
+            )
+        strategy = self.strategies[viz_type]
+        return strategy.create(df, **kwargs)
+    def add_strategy(self, name: str, strategy: VisualizationStrategy) -> None:
+        """
+        Add new visualization strategy.
+        Follows Open/Closed Principle - extend functionality without modifying existing code.
+        Args:
+            name: Name for the strategy
+            strategy: Visualization strategy instance
+        """
+        self.strategies[name] = strategy
+        logger.info(f"Added new visualization strategy: {name}")
+    def get_available_visualizations(self) -> List[str]:
+        """
+        Get list of available visualization types.
+        Returns:
+            List of visualization type names
+        """
+        return list(self.strategies.keys())
+    def get_required_params(self, viz_type: str) -> List[str]:
+        """
+        Get required parameters for a visualization type.
+        Args:
+            viz_type: Type of visualization
+        Returns:
+            List of required parameter names
+        """
+        if viz_type not in self.strategies:
+            raise ValueError(f"Unsupported visualization type: {viz_type}")
+        return self.strategies[viz_type].get_required_params()
+# ============================================================================
+# UTILITY FUNCTIONS FOR SAVING VISUALIZATIONS
+# ============================================================================
+def save_visualization(fig: Any, filepath: Union[str, Path],
+                      dpi: int = 300, format: str = 'png') -> bool:
+    """
+    Save visualization to file.
+    Args:
+        fig: Matplotlib or Plotly figure
+        filepath: Path to save file
+        dpi: DPI for raster formats
+        format: File format ('png', 'jpg', 'pdf', 'svg', 'html')
+    Returns:
+        bool: True if saved successfully
+    """
+    try:
+        filepath = Path(filepath)
+        # Handle matplotlib figures
+        if hasattr(fig, 'savefig'):
+            fig.savefig(filepath, dpi=dpi, bbox_inches='tight', format=format)
+            logger.info(f"Saved matplotlib figure to {filepath}")
+        # Handle plotly figures
+        elif hasattr(fig, 'write_image') or hasattr(fig, 'write_html'):
+            if format in ['png', 'jpg', 'pdf', 'svg']:
+                fig.write_image(filepath, format=format)
+            elif format == 'html':
+                fig.write_html(filepath)
+            logger.info(f"Saved plotly figure to {filepath}")
+        else:
+            raise ValueError("Unknown figure type")
+        return True
+    except Exception as e:
+        logger.error(f"Error saving visualization: {e}")
+        return False
+if __name__ == "__main__":
+    # Example usage
+    print("Visualizations module loaded successfully")
+    # Demonstrate available visualizations
+    manager = VisualizationManager()
+    print(f"Available visualizations: {manager.get_available_visualizations()}")