API Reference

Complete API documentation for the ARFF Format Converter v2.0 Python package.

Python API

ARFFConverter Class

Constructor

from arff_format_converter import ARFFConverter

converter = ARFFConverter(
    fast_mode=True,      # Skip validation for speed
    parallel=True,       # Use multiple cores
    use_polars=True,     # Use Polars for max performance
    memory_map=True,     # Enable memory mapping
    chunk_size=50000     # Chunk size for large files
)
Parameters
  • fast_mode (bool): Skip validation for maximum speed, default: False
  • parallel (bool): Enable multi-core processing, default: True
  • use_polars (bool): Use Polars for ultra-fast processing, default: True
  • memory_map (bool): Enable memory mapping for large files, default: False
  • chunk_size (int): Chunk size for processing large datasets, default: 10000

Methods

convert(input_file, output_dir, output_format)

Converts ARFF file to specified format with optimal performance.

  • input_file (Path|str): Path to ARFF file
  • output_dir (Path|str): Output directory path
  • output_format (str): Target format (csv, json, parquet, xlsx, xml, orc)
  • Returns: ConversionResult with timing and file info
from pathlib import Path

# Basic conversion
result = converter.convert(
    input_file=Path("data.arff"),
    output_dir=Path("output"),
    output_format="parquet"
)

print(f"Conversion completed in {result.duration:.2f}s")
print(f"Output file: {result.output_file}")
print(f"File size: {result.file_size_mb:.1f} MB")
batch_convert(input_files, output_dir, output_format, parallel=True)

Converts multiple ARFF files efficiently with parallel processing.

  • input_files (List[Path]): List of ARFF file paths
  • output_dir (Path|str): Output directory path
  • output_format (str): Target format
  • parallel (bool): Enable parallel batch processing
  • Returns: List[ConversionResult]
# Process entire directory
input_files = list(Path("data").glob("*.arff"))
results = converter.batch_convert(
    input_files=input_files,
    output_dir=Path("output"),
    output_format="parquet",
    parallel=True
)

print(f"Converted {len(results)} files successfully!")
for result in results:
    print(f"{result.input_file.name}: {result.duration:.2f}s")
benchmark(input_file, formats=None, iterations=3)

Benchmark conversion performance across different formats.

  • input_file (Path|str): Path to ARFF file for benchmarking
  • formats (List[str], optional): Formats to test, defaults to all
  • iterations (int): Number of benchmark iterations
  • Returns: Dict[str, BenchmarkResult]
# Benchmark all formats
results = converter.benchmark(
    input_file=Path("large_dataset.arff"),
    formats=["csv", "json", "parquet", "xlsx"],
    iterations=3
)

# View detailed results
for format_name, metrics in results.items():
    print(f"{format_name}: {metrics['duration']:.1f}ms, "
          f"{metrics['file_size_mb']:.1f}MB, "
          f"{metrics['speed_rating']}")

# Find fastest format
fastest = min(results.items(), key=lambda x: x[1]['duration'])
print(f"Fastest format: {fastest[0]} ({fastest[1]['duration']:.1f}ms)")

CLI Reference

Basic Usage

arff-format-converter [OPTIONS]

Options

  • --file, -f: Input ARFF file path (required)
  • --output, -o: Output directory path (required)
  • --format: Output format (csv, json, parquet, xlsx, xml, orc)
  • --fast: Enable fast mode (skip validation)
  • --parallel: Enable parallel processing
  • --chunk-size: Chunk size for large files (default: 10000)
  • --benchmark: Run performance benchmark
  • --info: Show supported formats and performance tips
  • --verbose, -v: Enable verbose output

Examples

# Basic conversion
arff-format-converter --file data.arff --output ./output --format csv

# High-performance mode
arff-format-converter --file data.arff --output ./output --format parquet --fast --parallel

# Large file processing
arff-format-converter --file large_data.arff --output ./output --format parquet --chunk-size 100000 --verbose

# Benchmark different formats
arff-format-converter --file data.arff --output ./benchmarks --benchmark

# Get format information
arff-format-converter --info

Performance Configuration

Ultra-Performance Mode

# Maximum speed configuration
converter = ARFFConverter(
    fast_mode=True,           # Skip validation
    parallel=True,            # Multi-core processing
    use_polars=True,          # Use Polars optimization
    memory_map=True,          # Enable memory mapping
    chunk_size=100000         # Large chunks
)

# For production workloads
result = converter.convert(
    input_file="large_dataset.arff",
    output_dir="./output",
    output_format="parquet"
)

Memory-Constrained Environments

# For limited memory systems
converter = ARFFConverter(
    fast_mode=False,          # Enable validation
    parallel=False,           # Single-threaded
    use_polars=False,         # Use pandas only
    chunk_size=5000          # Smaller chunks
)

# Gentle processing
result = converter.convert(
    input_file="data.arff",
    output_dir="./output",
    output_format="csv"
)

Data Types & Classes

ConversionResult

@dataclass
class ConversionResult:
    input_file: Path
    output_file: Path
    output_format: str
    duration: float          # Conversion time in seconds
    file_size_mb: float     # Output file size in MB
    rows_processed: int     # Number of data rows
    success: bool           # Conversion success status
    error_message: str      # Error details if failed

BenchmarkResult

@dataclass
class BenchmarkResult:
    format_name: str
    duration: float         # Average duration in milliseconds
    file_size_mb: float    # Output file size in MB
    compression_ratio: float # Compression vs original
    speed_rating: str      # Performance rating (Blazing, Ultra Fast, etc.)
    iterations: int        # Number of benchmark runs

⚡ Performance Tips

  • • Use fast_mode=True for production workloads (20-30% speed gain)
  • • Enable parallel=True for multi-core systems (2-4x speed gain)
  • • Choose Parquet format for best overall performance and compression
  • • Use memory_map=True for files larger than 1GB
  • • Increase chunk_size for better performance with large datasets

⚠️ Important Notes

  • fast_mode=True skips data validation - use only with trusted inputs
  • • Large chunk_size values require more memory
  • • Some optimizations may not be available on all systems
  • • Always benchmark your specific use case for optimal settings