Resources

Adopting Apache Iceberg for Our Data Lakehouse - Part 1: Why We Chose Iceberg

2025.07.09

✅ Title: Adopting Apache Iceberg for Our Data Lakehouse - Part 1: Why We Chose Iceberg

1. Introduction

In the cybersecurity domain, effective threat analysis requires access to diverse forms of data. S2W has long been collecting and analyzing various data types from multiple sources, including dark web forums and social media platforms. Since most of this data is acquired in unstructured formats, it must be systematically processed and structured before analysts can derive value from it.

This article presents the design and implementation process of a system built to support real-time and parallel analysis of data contained within large compressed archives.

2. Existing Structure and Emerging Requirements

2-1. Characteristics of Multi-Gigabyte Archive Files

Data shared via dark web and deep web sources frequently comes in the form of large compressed archive files. These files often contain leaked credentials, system breach logs, or re-packaged data dumps and exhibit the following characteristics:

Complex nested folder structures
Files with mixed extensions (e.g., .jpg, .txt, .docx) in a single archive
Specific folder or file naming patterns
Encrypted or DRM-protected content requiring contextual metadata for access

2-2. Initial Data Lake Design for Raw Data Preservation

(ETL vs ELT, https://aws.amazon.com/compare/the-difference-between-etl-and-elt/?nc1=h_ls)

Given the high likelihood that distributed datasets contain sensitive or leaked cybersecurity-relevant information, preserving the original state of the data is critical. However, the structure and format of these large compressed archives pose challenges. If we attempted real-time transformation and selection at the point of ingestion, there would be a significant risk of data loss.

To mitigate this, we adopted an ELT (Extract, Load, Transform) model rather than the traditional ETL approach—prioritizing preservation of raw data over immediate transformation.

2-3. Data Structuring Process

(Data Structuring Process for Large-Scale Compressed Files)

Structured data derived from large compressed files is indexed into our engine using the following three-step procedure:

1. Raw Data Ingestion: Archive files are stored in their original form along with contextual metadata (e.g., file type, password, source path).

2. Data Extraction: Selected information is parsed and structured using predefined modules.

3. Indexing & Access: Structured outputs are indexed and made accessible to applications and analysts.

This pipeline was initially designed to extract relevant data at ingestion time using a predefined transformation module.

2-4. Evolving System Requirements

Selective extraction of only analysis-relevant files proved highly efficient in terms of both processing speed and storage usage. It minimized dependency on high-cost indexing systems while still supporting relevant analytics workflows.

However, as internal service sophistication progressed, new demands on our data pipeline emerged:

Scalability of Analysis Modules: As our analysis engines evolved, so did their ability to detect patterns, parse new file types, and improve data extraction. This evolution rendered earlier structured datasets obsolete, requiring re-analysis using updated modules. Since our existing architecture required decompressing the entire archive each time, performance was significantly degraded due to I/O overhead.

Access to Original Files: Structured datasets often omit contextual details that analysts need. To retrieve original content, we had to manually extract and deliver the full compressed files, resulting in poor accessibility and response time.

2-5. Small File Problem

To improve accessibility and support use cases requiring downloads of individual files, we considered extracting all content and storing it as separate files.

This approach has benefits:

1. Improved File-Level Accessibility: Analysts can quickly retrieve specific file types (e.g., PDFs only).

2. Simplified Analysis Pipelines: Eliminates the need for archive decompression logic in downstream applications.

However, simply extracting and storing all files posed technical challenges, particularly the well-known Small File Problem. This issue affects Hadoop, S3 object storage, ext4 filesystems, and other environments where large volumes of small files overwhelm metadata capacity.

For example, in erasure-coded object storage systems like MinIO, each data block must be accompanied by metadata identifying its associated object. With a high file count, metadata management scales exponentially.

In a 4:2 EC configuration, every object is split into four data and two parity shards—each with its own metadata. Storing 100,000 files as 1KB objects results in roughly 100MB of actual data but ~153MB of metadata.

Such metadata overhead leads to inode exhaustion, degraded IOPS, and increased cache usage—ultimately harming scalability.

We therefore determined that simply extracting all files into storage would not provide a scalable solution.

3. Architecture Redesign and Technology Stack Evaluation

To address the emerging requirements, our redesigned architecture had to ensure distributed scalability, real-time query support, and flexible indexing—all without sacrificing access to raw data. We summarized our architectural goals as follows:

📌 Key Architectural Requirements

Requirement	Details	Example
Real-time access to raw files	Support file listing and partial reads (seek) without decompressing archives	Extract specific .log files from a .7z archive containing thousands of files
SQL queryability	Run structured queries without preprocessing or manual transformation	Query CSV dump: SELECT * FROM dump WHERE email LIKE '%@google.com'
Distributed processing integration	Compatible with Spark, Flink, and other distributed engines	Parallel extraction of compressed files across hundreds of workers
Partial indexing support	Index only necessary attributes to minimize overhead	Index email subjects only; full scan for other fields
High availability (HA)	Guarantee persistence and recoverability of raw + structured data	Replicated storage and metadata resilience
Small file problem mitigation	Optimize storage for large volumes of small files	Block-level storage of extracted archive contents

Based on these requirements, we reviewed various technologies:

📌 Technology Stack Comparison

Stack	Raw File Access	SQL Support	Distributed Processing	Small File Handling	Partial Indexing	Data HA	Notes
HDFS + MapReduce	✖ (Batch only)	✔ (via Hive)	✔	✖	✖	✔	Legacy, not real-time optimized
HBase	✔ (Row access)	✔ (via Phoenix)	✔	✔	Only RowKey	✔	Complex RBAC, cluster maintenance overhead
Apache Iceberg	✔ (Parquet-based)	✔	✔	✔	✔ (Flexible)	✔	Meets all requirements
Parquet + S3	Partial	✖ (Needs engine)	✔	✔	✖	✔	Metadata management complexity

After extensive review, we selected Apache Iceberg based on its operational simplicity, scalability, and feature richness. Although HBase was also considered due to our existing Hadoop infrastructure, its limitations in column-level queries, steep learning curve for RBAC, and operational overhead led to its exclusion.

Iceberg, while powerful, has its own limitations:

Storage performance varies based on backend (e.g., S3, HDFS)
Performance highly dependent on partitioning design
Extra I/O cost for single-file access via Parquet format
Requires separate metadata DB management
No daemon—external control app must be implemented
Lacks native indexing; relies on Parquet scan optimization

Despite these trade-offs, we accepted two compromises: 1) increased operational cost, and 2) no real-time indexing. We determined that flexible pipeline configuration at the application level would offset these constraints.

Based on this assessment, we concluded that Iceberg is the optimal solution for our evolving architecture.

📌 Upcoming Article <Adopting Apache Iceberg for Our Data Lakehouse - Part 2>: In Part 2, we will explore the technical challenges encountered during actual implementation—and how we overcame them.

🧑‍💻 Author: S2W KE Team

👉 Contact Us: https://s2w.inc/en/contact

*Discover more about SAIP, S2W’s Generative AI Platform, below.

Threat Intelligence Reports

AiLock Ransomware Analysis Report: Techniques, Tactics, and Indicators

2025.07.08

News Highlights

Weekly Darkweb in July W1

2025.07.09

List