When Data Is Too Big For Excel

Every analyst has experienced the moment: you download a dataset from the data room, double-click the CSV, and watch Excel freeze. The file has a million rows, maybe more. The columns are dense with transaction-level detail — timestamps, SKUs, customer IDs, pricing data. Excel's row limit of roughly 1.05 million rows is not just a technical constraint; it is a ceiling on the analytical depth your diligence team can achieve. When a deal depends on understanding granular customer behavior or transaction patterns, you need a different approach.

The simplest step up from Excel is Python with the pandas library. A basic pandas script can load, filter, and aggregate a multi-million-row dataset in seconds on a standard laptop. For analysts who are comfortable with Excel formulas, the transition is surprisingly gentle — operations like VLOOKUP, pivot tables, and conditional aggregation all have direct pandas equivalents. The real advantage is not speed but reproducibility: a Python script documents exactly what you did to the data, making it trivial to rerun the analysis when the dataset is updated or when a partner asks you to cut the numbers a different way.

For datasets that exceed even pandas' comfortable range — say, tens of millions of rows or multiple gigabytes — SQL databases and cloud tools become necessary. Setting up a local PostgreSQL database takes minutes and lets you query massive datasets with the same SELECT statements that power enterprise analytics. Cloud platforms like BigQuery or Snowflake go further, enabling analysis of terabyte-scale datasets without any local infrastructure. The cost for a diligence-length engagement is typically under a hundred dollars.

The broader point is that data size should never be a bottleneck in diligence. The tools to handle large datasets are free or nearly free, well-documented, and learnable in days. When a million-row dataset lands in the data room, the correct response is not to sample it down to something Excel can handle — it is to bring the right tool to the full dataset. Sampling introduces bias. Aggregating too early hides the outliers that often matter most. In diligence, the signal is frequently in the tails of the distribution, and you will only find it if you look at all the data.

When Data Is Too Big For Excel

Related Articles

FTX, Theranos – What is the Role of Data in Due Diligence?

Learning to Innovate from Amazon

Want to discuss this topic?