High Performance Python for RAG Data Pipelines

Modern AI needs clean, well prepared data. Retrieval augmented generation and search assistants depend on good chunking and reliable embeddings. A slow or careless pipeline will hold the whole system back. This article upgrades a single sample application from plain Python to a fast, robust pipeline. Each step explains what it gives you, where it helps, and where it does not. The sample application We build a RAG prep engine. It ingests PDFs, images, HTML, and JSON. It runs OCR where needed. It cleans and normalises text. It detects language and removes boilerplate. It segments into chunks that respect headings and sentences. It deduplicates near copies. It computes embeddings and writes records with metadata to a vector store. The starting point is a simple Python script with loops, regexes, and ad hoc batching. It is correct, but slow on large corpora. How to measure fairly We define small, medium, and large corpora. We pin Python and library versions. We run on one machine with a stable power plan. We warm up models so caches and JITs settle. We record wall time, CPU time, memory, and p50, p95, p99 latencies. We keep a baseline and a results table. We measure end to end and by stage. Chunk quality and embedding recall are part of the score, not only speed. Start with algorithms and data We treat algorithm choice as our biggest lever. We look past habit and search with our workload in mind. Regular expressions work, yet with many patterns they become costly, so we switch to Aho-Corasick to scan once and find them all. We also replace row wise pandas code with a columnar engine such as Polars or DuckDB, which plans vectorised, parallel steps and turns slow apply loops into fast queries. For similarity search at scale, we avoid brute force cosine checks and use approximate nearest neighbour methods like HNSW via hnswlib or FAISS. We accept a tiny loss in recall for a large gain in speed, and we keep the change portable and easy to roll back. We remove work we do not need. Chunking uses sentence and heading boundaries. We avoid splitting in the middle of tables and code. We keep target sizes in tokens, not characters. A range of 400 to 800 tokens is a good start. We deduplicate with content hashes and near neighbour signatures such as MinHash. We precompile regexes and collapse duplicates. We replace repeated substring checks with a trie or Aho Corasick where patterns are many. These changes cut complexity and reduce Python overhead. They matter on long pages and large rule sets. They give little gain on tiny files. Make I/O efficient The pipeline touches storage a lot. We use buffered reads and memory maps for large files. We batch writes to the vector store. We avoid chatty calls. We compress only when network or disk dominates time. These changes shine on slow disks or remote object stores. They add little when the job is already CPU bound. Use fast Python, not clever Python We move invariant work out of loops. We keep references to hot attributes in locals. We prefer built in operations for joins and counts. We avoid creating short lived objects in tight loops. These edits add small to medium gains with low risk. They do not fix bad algorithms. We keep the code readable so future engineers can maintain it. Threads for I/O Many steps wait on I/O. Network fetches, object storage, and database calls all block. Threads help because waiting releases the interpreter lock. We wrap I/O with a thread pool and cap the pool to a sensible size. This lowers wall time on high latency storage. It does not help for pure CPU work. Too many threads can cause context switch overhead and subtle bugs. We keep it simple and measured. Async for high concurrency If the pipeline must fetch thousands of small objects, an async loop helps. One process can handle many sockets with low memory use. It shines for HTTP APIs and object storage with small blobs. It does not help for heavy compute. Blocking calls in the loop will ruin it. It is a bigger change than a thread pool, so we use it when concurrency is very high. Processes for CPU work OCR, sentence segmentation, language detection, tokenisation, and embedding pre or post processing burn CPU. Threads will not scale because of the interpreter lock. A process pool spreads the work across cores. We pass small messages and share large read only data where we can. This scales on multicore machines and keeps the UI or driver responsive. It adds pickling cost and start up time. It needs careful shutdown and retries. Share memory wisely Large lookup tables and dedupe signatures should not be copied for every worker. We keep them in shared memory or memory mapped files. Workers read the same pages of memory. This reduces pressure on caches and the kernel. It adds lifecycle complexity. We use it only for big, hot structures that do not change often. Vectorise text math on CPU Many normalisation and scoring steps are numeric. We convert to arrays and use NumPy. We use broadcasting for token counts, lengths, and simple transforms. We use numexpr for compound expressions to avoid temporary arrays. This gives real gains on medium and large arrays. It can be slower on tiny inputs due to set up cost. Mixed dtypes can force copies. We watch memory closely. JIT compile hot loops Some loops remain in Python form. We add Numba to compile them. We keep data in typed arrays and avoid Python objects inside the hot path. We can parallelise with prange when iterations are independent. First call pays a compile cost, so we warm up at start. Numba brings strong speedups on numeric loops. It will not help loops that manipulate Python strings or complex objects. Use SIMD through the right libraries Modern CPUs execute wide operations on vectors. Good libraries already use these instructions. NumPy linked to a fast BLAS, fast tokenisers, and RapidFuzz style distance functions can all use SIMD. This is ideal for dense math and distance calculations. It helps less on branch heavy logic. Results may differ by tiny amounts due to evaluation order. Tests should allow small tolerances. GPU for embeddings and vision Transformer encoders run very well on a GPU. If embeddings dominate time, offload them. Batch texts to use the device well. Keep data on the device between steps if possible. A GPU does not help for small batches that fit in CPU cache. Copy time over the bus can erase gains. GPU memory is limited, so we stream batches and check for out of memory errors. We keep a CPU fallback to stay portable. Move the last hot spot to native code If a small part is still slow and resists JIT and vectorisation, write a tiny native kernel. Cython with typed memoryviews or Rust with pyo3 can handle a tight loop over bytes. This gives stable speed and no start up cost. It adds toolchains and wheels per platform. We only do this when the gain is clear and we have tests in place. Scale out when a node is full Dask or Ray can shard a large corpus by file or by day. They add a scheduler and network overhead. Scale out only after a single node has been pushed with all the methods above. Keep the same harness and the same datasets. Measure end to end time and cost. Be sure the cluster shortens the wall clock, not only the kernel time. Cache and reuse work Reprocessing costs money and time. We cache OCR output, content hashes, and dedupe signatures. We record embedding model versions and only re embed when the model changes. We use content checksums to skip files that have not changed. Caching is powerful when corpora update incrementally. It gives little benefit when every document is new each run. Reduce allocation and pause time We prefer arrays of primitives and compact structures. We avoid building large lists of small objects. We preallocate buffers for streaming. We measure garbage collector activity before tuning. Freezing long lived objects at start can reduce pauses. Tuning without evidence risks making things worse. Get free speed from the stack We install NumPy linked to a fast BLAS. We set OpenMP threads to match cores. We pin process affinity during tests. We keep the CPU governor on performance mode. We document these choices so results are repeatable. We avoid exotic flags that reduce portability. Keep results correct Quality is part of performance. We keep golden documents. We test chunk boundaries against rules. We evaluate embedding recall on a fixed query set. We check numeric tolerances where SIMD and parallel order changes results slightly. We log versions, seeds, and hardware. We keep a deterministic mode for audits. When each option helps Threads help when the pipeline waits on storage or the network. Async helps when the concurrency count is very high. Processes help for CPU bound work such as OCR, tokenisation, and language detection. Vectorisation helps when data is already numeric and large enough. Numba helps for numeric loops that still need Python syntax. SIMD appears when the right libraries are in place. A GPU helps when embeddings dominate and batches are large. Native code helps for a tiny, stubborn inner loop. Distribution helps only after single node work is done. Compare apples with apples We run the baseline and every variant on the same machine. We fix inputs, batch sizes, and seeds. We record mean, median, and tail latencies. We add memory and throughput. We capture cost when cloud time is part of the picture. We annotate every result with the change we made. We show wins and losses. We roll back when a change is fast but brittle. This article was created by people. We have used artificial intelligence (AI) to help articulate our message and refine the text. AI was employed as a tool to assist with structuring, identifying grammatical and spelling errors, and improving readability. The final document has been carefully reviewed and approved by our team.

High Performance Python for RAG Data Pipelines

Interested in working together?