API Reference
Complete API documentation for the ARFF Format Converter v2.0 Python package.
Python API
ARFFConverter Class
Constructor
from arff_format_converter import ARFFConverter
converter = ARFFConverter(
fast_mode=True, # Skip validation for speed
parallel=True, # Use multiple cores
use_polars=True, # Use Polars for max performance
memory_map=True, # Enable memory mapping
chunk_size=50000 # Chunk size for large files
)Parameters
fast_mode(bool): Skip validation for maximum speed, default: Falseparallel(bool): Enable multi-core processing, default: Trueuse_polars(bool): Use Polars for ultra-fast processing, default: Truememory_map(bool): Enable memory mapping for large files, default: Falsechunk_size(int): Chunk size for processing large datasets, default: 10000
Methods
convert(input_file, output_dir, output_format)
Converts ARFF file to specified format with optimal performance.
input_file(Path|str): Path to ARFF fileoutput_dir(Path|str): Output directory pathoutput_format(str): Target format (csv, json, parquet, xlsx, xml, orc)- Returns: ConversionResult with timing and file info
from pathlib import Path
# Basic conversion
result = converter.convert(
input_file=Path("data.arff"),
output_dir=Path("output"),
output_format="parquet"
)
print(f"Conversion completed in {result.duration:.2f}s")
print(f"Output file: {result.output_file}")
print(f"File size: {result.file_size_mb:.1f} MB")batch_convert(input_files, output_dir, output_format, parallel=True)
Converts multiple ARFF files efficiently with parallel processing.
input_files(List[Path]): List of ARFF file pathsoutput_dir(Path|str): Output directory pathoutput_format(str): Target formatparallel(bool): Enable parallel batch processing- Returns: List[ConversionResult]
# Process entire directory
input_files = list(Path("data").glob("*.arff"))
results = converter.batch_convert(
input_files=input_files,
output_dir=Path("output"),
output_format="parquet",
parallel=True
)
print(f"Converted {len(results)} files successfully!")
for result in results:
print(f"{result.input_file.name}: {result.duration:.2f}s")benchmark(input_file, formats=None, iterations=3)
Benchmark conversion performance across different formats.
input_file(Path|str): Path to ARFF file for benchmarkingformats(List[str], optional): Formats to test, defaults to alliterations(int): Number of benchmark iterations- Returns: Dict[str, BenchmarkResult]
# Benchmark all formats
results = converter.benchmark(
input_file=Path("large_dataset.arff"),
formats=["csv", "json", "parquet", "xlsx"],
iterations=3
)
# View detailed results
for format_name, metrics in results.items():
print(f"{format_name}: {metrics['duration']:.1f}ms, "
f"{metrics['file_size_mb']:.1f}MB, "
f"{metrics['speed_rating']}")
# Find fastest format
fastest = min(results.items(), key=lambda x: x[1]['duration'])
print(f"Fastest format: {fastest[0]} ({fastest[1]['duration']:.1f}ms)")CLI Reference
Basic Usage
arff-format-converter [OPTIONS]Options
--file, -f: Input ARFF file path (required)--output, -o: Output directory path (required)--format: Output format (csv, json, parquet, xlsx, xml, orc)--fast: Enable fast mode (skip validation)--parallel: Enable parallel processing--chunk-size: Chunk size for large files (default: 10000)--benchmark: Run performance benchmark--info: Show supported formats and performance tips--verbose, -v: Enable verbose output
Examples
# Basic conversion
arff-format-converter --file data.arff --output ./output --format csv
# High-performance mode
arff-format-converter --file data.arff --output ./output --format parquet --fast --parallel
# Large file processing
arff-format-converter --file large_data.arff --output ./output --format parquet --chunk-size 100000 --verbose
# Benchmark different formats
arff-format-converter --file data.arff --output ./benchmarks --benchmark
# Get format information
arff-format-converter --infoPerformance Configuration
Ultra-Performance Mode
# Maximum speed configuration
converter = ARFFConverter(
fast_mode=True, # Skip validation
parallel=True, # Multi-core processing
use_polars=True, # Use Polars optimization
memory_map=True, # Enable memory mapping
chunk_size=100000 # Large chunks
)
# For production workloads
result = converter.convert(
input_file="large_dataset.arff",
output_dir="./output",
output_format="parquet"
)Memory-Constrained Environments
# For limited memory systems
converter = ARFFConverter(
fast_mode=False, # Enable validation
parallel=False, # Single-threaded
use_polars=False, # Use pandas only
chunk_size=5000 # Smaller chunks
)
# Gentle processing
result = converter.convert(
input_file="data.arff",
output_dir="./output",
output_format="csv"
)Data Types & Classes
ConversionResult
@dataclass
class ConversionResult:
input_file: Path
output_file: Path
output_format: str
duration: float # Conversion time in seconds
file_size_mb: float # Output file size in MB
rows_processed: int # Number of data rows
success: bool # Conversion success status
error_message: str # Error details if failedBenchmarkResult
@dataclass
class BenchmarkResult:
format_name: str
duration: float # Average duration in milliseconds
file_size_mb: float # Output file size in MB
compression_ratio: float # Compression vs original
speed_rating: str # Performance rating (Blazing, Ultra Fast, etc.)
iterations: int # Number of benchmark runs⚡ Performance Tips
- • Use
fast_mode=Truefor production workloads (20-30% speed gain) - • Enable
parallel=Truefor multi-core systems (2-4x speed gain) - • Choose Parquet format for best overall performance and compression
- • Use
memory_map=Truefor files larger than 1GB - • Increase
chunk_sizefor better performance with large datasets
⚠️ Important Notes
- •
fast_mode=Trueskips data validation - use only with trusted inputs - • Large
chunk_sizevalues require more memory - • Some optimizations may not be available on all systems
- • Always benchmark your specific use case for optimal settings