✅ Title: Adopting Apache Iceberg for Our Data Lakehouse - Part 1: Why We Chose Iceberg
1. Introduction
In the cybersecurity domain, effective threat analysis requires access to diverse forms of data. S2W has long been collecting and analyzing various data types from multiple sources, including dark web forums and social media platforms. Since most of this data is acquired in unstructured formats, it must be systematically processed and structured before analysts can derive value from it.
This article presents the design and implementation process of a system built to support real-time and parallel analysis of data contained within large compressed archives.
2. Existing Structure and Emerging Requirements
2-1. Characteristics of Multi-Gigabyte Archive Files
Data shared via dark web and deep web sources frequently comes in the form of large compressed archive files. These files often contain leaked credentials, system breach logs, or re-packaged data dumps and exhibit the following characteristics:
- Complex nested folder structures
- Files with mixed extensions (e.g., .jpg, .txt, .docx) in a single archive
- Specific folder or file naming patterns
- Encrypted or DRM-protected content requiring contextual metadata for access
2-2. Initial Data Lake Design for Raw Data Preservation
(ETL vs ELT, https://aws.amazon.com/compare/the-difference-between-etl-and-elt/?nc1=h_ls)
Given the high likelihood that distributed datasets contain sensitive or leaked cybersecurity-relevant information, preserving the original state of the data is critical. However, the structure and format of these large compressed archives pose challenges. If we attempted real-time transformation and selection at the point of ingestion, there would be a significant risk of data loss.
To mitigate this, we adopted an ELT (Extract, Load, Transform) model rather than the traditional ETL approach—prioritizing preservation of raw data over immediate transformation.
2-3. Data Structuring Process
(Data Structuring Process for Large-Scale Compressed Files)
Structured data derived from large compressed files is indexed into our engine using the following three-step procedure:
1. Raw Data Ingestion: Archive files are stored in their original form along with contextual metadata (e.g., file type, password, source path).
2. Data Extraction: Selected information is parsed and structured using predefined modules.
3. Indexing & Access: Structured outputs are indexed and made accessible to applications and analysts.
This pipeline was initially designed to extract relevant data at ingestion time using a predefined transformation module.
2-4. Evolving System Requirements
Selective extraction of only analysis-relevant files proved highly efficient in terms of both processing speed and storage usage. It minimized dependency on high-cost indexing systems while still supporting relevant analytics workflows.
However, as internal service sophistication progressed, new demands on our data pipeline emerged:
Scalability of Analysis Modules: As our analysis engines evolved, so did their ability to detect patterns, parse new file types, and improve data extraction. This evolution rendered earlier structured datasets obsolete, requiring re-analysis using updated modules. Since our existing architecture required decompressing the entire archive each time, performance was significantly degraded due to I/O overhead.
Access to Original Files: Structured datasets often omit contextual details that analysts need. To retrieve original content, we had to manually extract and deliver the full compressed files, resulting in poor accessibility and response time.
2-5. Small File Problem
To improve accessibility and support use cases requiring downloads of individual files, we considered extracting all content and storing it as separate files.
This approach has benefits:
1. Improved File-Level Accessibility: Analysts can quickly retrieve specific file types (e.g., PDFs only).
2. Simplified Analysis Pipelines: Eliminates the need for archive decompression logic in downstream applications.
However, simply extracting and storing all files posed technical challenges, particularly the well-known Small File Problem. This issue affects Hadoop, S3 object storage, ext4 filesystems, and other environments where large volumes of small files overwhelm metadata capacity.
For example, in erasure-coded object storage systems like MinIO, each data block must be accompanied by metadata identifying its associated object. With a high file count, metadata management scales exponentially.
In a 4:2 EC configuration, every object is split into four data and two parity shards—each with its own metadata. Storing 100,000 files as 1KB objects results in roughly 100MB of actual data but ~153MB of metadata.
Such metadata overhead leads to inode exhaustion, degraded IOPS, and increased cache usage—ultimately harming scalability.
We therefore determined that simply extracting all files into storage would not provide a scalable solution.
3. Architecture Redesign and Technology Stack Evaluation
To address the emerging requirements, our redesigned architecture had to ensure distributed scalability, real-time query support, and flexible indexing—all without sacrificing access to raw data. We summarized our architectural goals as follows:
📌 Key Architectural Requirements
| Requirement | Details | Example |
|---|---|---|
| Real-time access to raw files | Support file listing and partial reads (seek) without decompressing archives | Extract specific .log files from a .7z archive containing thousands of files |
| SQL queryability | Run structured queries without preprocessing or manual transformation | Query CSV dump: SELECT * FROM dump WHERE email LIKE '%@google.com' |
| Distributed processing integration | Compatible with Spark, Flink, and other distributed engines | Parallel extraction of compressed files across hundreds of workers |
| Partial indexing support | Index only necessary attributes to minimize overhead | Index email subjects only; full scan for other fields |
| High availability (HA) | Guarantee persistence and recoverability of raw + structured data | Replicated storage and metadata resilience |
| Small file problem mitigation | Optimize storage for large volumes of small files | Block-level storage of extracted archive contents |
Based on these requirements, we reviewed various technologies:
📌 Technology Stack Comparison
| Stack | Raw File Access | SQL Support | Distributed Processing | Small File Handling | Partial Indexing | Data HA | Notes |
|---|---|---|---|---|---|---|---|
| HDFS + MapReduce | ✖ (Batch only) | ✔ (via Hive) | ✔ | ✖ | ✖ | ✔ | Legacy, not real-time optimized |
| HBase | ✔ (Row access) | ✔ (via Phoenix) | ✔ | ✔ | Only RowKey | ✔ | Complex RBAC, cluster maintenance overhead |
| Apache Iceberg | ✔ (Parquet-based) | ✔ | ✔ | ✔ | ✔ (Flexible) | ✔ | Meets all requirements |
| Parquet + S3 | Partial | ✖ (Needs engine) | ✔ | ✔ | ✖ | ✔ | Metadata management complexity |
After extensive review, we selected Apache Iceberg based on its operational simplicity, scalability, and feature richness. Although HBase was also considered due to our existing Hadoop infrastructure, its limitations in column-level queries, steep learning curve for RBAC, and operational overhead led to its exclusion.
Iceberg, while powerful, has its own limitations:
- Storage performance varies based on backend (e.g., S3, HDFS)
- Performance highly dependent on partitioning design
- Extra I/O cost for single-file access via Parquet format
- Requires separate metadata DB management
- No daemon—external control app must be implemented
- Lacks native indexing; relies on Parquet scan optimization
Despite these trade-offs, we accepted two compromises: 1) increased operational cost, and 2) no real-time indexing. We determined that flexible pipeline configuration at the application level would offset these constraints.
Based on this assessment, we concluded that Iceberg is the optimal solution for our evolving architecture.
📌 Upcoming Article <Adopting Apache Iceberg for Our Data Lakehouse - Part 2>: In Part 2, we will explore the technical challenges encountered during actual implementation—and how we overcame them.
🧑💻 Author: S2W KE Team
👉 Contact Us: https://s2w.inc/en/contact
*Discover more about SAIP, S2W’s Generative AI Platform, below.