Batch Processing Bank Statements: A Production Guide
Process hundreds of Indian bank statements reliably in production — concurrency, retries, error handling, and progress tracking with Lekha API. Full TypeScript code.
Parsing a single bank statement is easy. Parsing five hundred of them reliably — while handling password-protected PDFs, network blips, and provider rate limits — is a different problem entirely. This guide walks through a production-grade batch pipeline for Indian financial documents using the Lekha API.
Why Batch Processing Matters in Indian Fintech
Loan processors, wealth managers, and credit bureaus routinely need to process large document volumes in tight windows. A mid-size NBFC onboarding borrowers might receive 300–500 bank statements overnight. A lending platform running a portfolio refresh might need to re-extract 10,000 statements across a quarter's worth of data. These use cases share three hard constraints:
A naive implementation that fires all requests simultaneously and crashes on the first error satisfies none of these.
The Naive Approach and Why It Breaks
// Don't do this in production
const results = await Promise.all(files.map((file) => lekha.extract(file)));
This pattern hits three problems at scale:
Promise.allBuilding a Production Batch Pipeline
Step 1: Install and Configure
bun add lekha-node p-limit p-retry
import Lekha from "lekha-node";
import pLimit from "p-limit";
import pRetry from "p-retry";
const lekha = new Lekha({ apiKey: process.env.LEKHA_API_KEY! });
// Tune concurrency to stay comfortably under rate limits.
// Lekha's paid plans support up to 20 concurrent requests.
const limit = pLimit(10);
Step 2: A Retryable Extract Function
Wrap each extraction in retry logic. Transient errors — network timeouts, 503s, occasional provider blips — should not fail a job permanently.
interface ExtractionResult {
id: string;
status: "success" | "failed";
data?: Record;
error?: string;
attempts: number;
}
async function extractWithRetry(
id: string,
fileBuffer: Buffer,
fileName: string,
): Promise {
let attempts = 0;
try {
const data = await pRetry(
async () => {
attempts++;
const result = await lekha.extract({
file: fileBuffer,
fileName,
documentType: "bank_statement",
});
return result;
},
{
retries: 3,
factor: 2,
minTimeout: 1000,
maxTimeout: 8000,
shouldRetry: (err: Error) => {
// Retry on transient errors, not on 4xx client errors
const status = (err as { status?: number }).status;
if (status && status >= 400 && status < 500 && status !== 429) {
return false;
}
return true;
},
},
);
return { id, status: "success", data, attempts };
} catch (err) {
return {
id,
status: "failed",
error: err instanceof Error ? err.message : String(err),
attempts,
};
}
}
The shouldRetry guard is important: a 400 Bad Request (e.g., an unsupported document format) will never succeed on retry — fail it fast and move on. A 429 Too Many Requests or 503 is worth retrying with backoff.
Step 3: Chunked Batch Runner with Progress Tracking
interface BatchJob {
id: string;
buffer: Buffer;
fileName: string;
}
interface BatchReport {
total: number;
succeeded: number;
failed: number;
results: ExtractionResult[];
durationMs: number;
}
async function runBatch(
jobs: BatchJob[],
onProgress?: (done: number, total: number) => void,
): Promise {
const startTime = Date.now();
let completed = 0;
const tasks = jobs.map((job) =>
limit(async () => {
const result = await extractWithRetry(job.id, job.buffer, job.fileName);
completed++;
onProgress?.(completed, jobs.length);
return result;
}),
);
const results = await Promise.allSettled(tasks);
const settled = results.map((r) =>
r.status === "fulfilled"
? r.value
: ({
id: "unknown",
status: "failed",
error: String(r.reason),
attempts: 0,
} as ExtractionResult),
);
const succeeded = settled.filter((r) => r.status === "success").length;
return {
total: jobs.length,
succeeded,
failed: jobs.length - succeeded,
results: settled,
durationMs: Date.now() - startTime,
};
}
Promise.allSettled is the key difference from Promise.all — it collects every result regardless of whether individual tasks throw.
Handling Real-World Edge Cases
Password-Protected PDFs
Many Indian bank statements (HDFC, ICICI, Kotak) are password-protected using the account holder's date of birth. Lekha handles decryption automatically when you pass the password:
const result = await lekha.extract({
file: fileBuffer,
fileName: "hdfc_statement.pdf",
documentType: "bank_statement",
password: "01011990", // DOB in DDMMYYYY
});
In a batch context, store passwords alongside job metadata:
interface BatchJob {
id: string;
buffer: Buffer;
fileName: string;
password?: string; // optional, only for encrypted files
}
Multi-Page Statements
Lekha's extraction engine handles multi-page PDFs natively — a 24-page ICICI statement gets parsed the same way as a 2-page one. You don't need to split pages before sending. What changes is response time: allow for longer timeouts on large documents.
const lekha = new Lekha({
apiKey: process.env.LEKHA_API_KEY!,
timeout: 120_000, // 2 minutes for large statements
});
Mixed Document Types in a Single Batch
If your batch contains a mix of bank statements, salary slips, and ITR documents, let the classifier decide rather than forcing a document type:
// Omit documentType to use automatic classification
const result = await lekha.extract({
file: fileBuffer,
fileName: job.fileName,
// no documentType — Lekha classifies automatically
});
// The response will include the detected type
console.log(result.documentType); // "bank_statement" | "salary_slip" | "itr" | ...
Observability: Know What's Happening
A batch that runs silently for 30 minutes and then crashes is useless. Wire in structured logging from the start.
import { createLogger } from "your-logger"; // winston, pino, etc.
const logger = createLogger({ service: "batch-extractor" });
async function extractWithRetry(
id: string,
fileBuffer: Buffer,
fileName: string,
): Promise {
logger.info("extraction.start", {
id,
fileName,
sizeBytes: fileBuffer.length,
});
// ... retry logic ...
if (result.status === "success") {
logger.info("extraction.success", { id, attempts });
} else {
logger.warn("extraction.failed", { id, error: result.error, attempts });
}
return result;
}
Emit a summary at the end of every batch run:
const report = await runBatch(jobs, (done, total) => {
if (done % 50 === 0 || done === total) {
logger.info("batch.progress", {
done,
total,
pct: Math.round((done / total) * 100),
});
}
});
logger.info("batch.complete", {
total: report.total,
succeeded: report.succeeded,
failed: report.failed,
successRate: ${((report.succeeded / report.total) * 100).toFixed(1)}%,
durationMs: report.durationMs,
throughput: ${(report.total / (report.durationMs / 1000)).toFixed(1)} docs/sec,
});
Full Working Example
Here is a complete script you can drop into any Node.js/Bun project:
import Lekha from "lekha-node";
import pLimit from "p-limit";
import pRetry from "p-retry";
import { readdir, readFile } from "fs/promises";
import path from "path";
const lekha = new Lekha({ apiKey: process.env.LEKHA_API_KEY! });
const limit = pLimit(10);
async function main() {
// Load all PDFs from a local directory
const dir = "./statements";
const files = (await readdir(dir)).filter((f) => f.endsWith(".pdf"));
const jobs = await Promise.all(
files.map(async (fileName) => ({
id: fileName,
buffer: await readFile(path.join(dir, fileName)),
fileName,
})),
);
console.log(Starting batch: ${jobs.length} documents);
const report = await runBatch(jobs, (done, total) => {
process.stdout.write(\r${done}/${total} processed...);
});
console.log("\n--- Batch Complete ---");
console.log(Success: ${report.succeeded}/${report.total});
console.log(Failed: ${report.failed}/${report.total});
console.log(Time: ${(report.durationMs / 1000).toFixed(1)}s);
// Write failures to a CSV for manual review
const failed = report.results.filter((r) => r.status === "failed");
if (failed.length > 0) {
const csv = failed.map((r) => ${r.id},${r.error}).join("\n");
await Bun.write("failed.csv", id,error\n${csv});
console.log(Failures written to failed.csv);
}
}
main().catch(console.error);
Swap the readdir section for your database query, S3 list, or queue consumer — the rest of the pipeline stays the same.
Performance Tuning
| Concurrency | 100 docs | 500 docs | 1000 docs | | ----------- | -------- | -------- | --------- | | 5 | ~4 min | ~20 min | ~40 min | | 10 | ~2 min | ~10 min | ~20 min | | 20 | ~60 sec | ~5 min | ~10 min |
Times are approximate and depend on document complexity and page count. Start at 10 and increase only after verifying you're not hitting rate limit errors. Check your plan limits in the Lekha dashboard.
For very large batches (5,000+ documents), consider splitting into nightly sub-batches of 500–1,000 documents each rather than running one massive job. Smaller batches are easier to monitor, resume on failure, and don't risk breaching daily quota limits.
FAQ
How many documents can Lekha process per day? It depends on your plan. The Growth plan supports up to 5,000 extractions/month; the Scale plan is unlimited. See lekhadev.com/docs for current plan details. For burst-heavy use cases (overnight batch runs), contact the team for dedicated rate limit configuration. What happens if a document fails after all retries? The result object returnsstatus: "failed" with the error message. The rest of the batch continues unaffected. Log the failure, write it to a dead-letter queue or CSV, and schedule a manual review or re-run.
Can I process non-PDF formats (images, Excel exports)?
Yes. Lekha accepts JPEG, PNG, and PDF. Pass the correct fileName extension so the API selects the right processing path. Excel bank statement exports (some co-operative banks provide .xlsx) are not currently supported — convert to PDF first.
Should I pre-split multi-page PDFs before sending?
No. Lekha handles multi-page documents natively. Splitting adds complexity and can break statements that span multiple pages (transactions continuing across a page boundary). Send the whole file.
Ready to run your first batch? Grab an API key and try the Lekha playground, or head to the docs to see all supported document types. For high-volume use cases, sign up for a Scale plan or reach out for a custom integration walkthrough.