7. Chapter review
File handling stops being a special case once you have the shape: open in 'rb', validate before the wire, stream when the size is unbounded, surface progress, handle each error explicitly. You walked from a one-line basic_upload.py to a ProfilePictureUploader with magic-byte validation, then to a ProgressFileReader generator that keeps memory flat regardless of file size, then to safe downloads, concurrent batches, and a Receipt Scanner that turns a phone-camera image into a structured Python dict. That stack is the same one that powers production expense apps, invoice systems, and document-management platforms.
Key skills
- Binary data handling. Open non-text files in
'rb'mode without corruption; recognise whymultipart/form-dataencoding exists and when it is the right wire format. - File upload patterns. Use
files=onrequests.post()to send a file, combine withdata=for form metadata, and avoid thejson=/files=mutual exclusion. - Streaming and memory management. Build generator-based readers that yield fixed-size chunks so memory stays flat through arbitrarily large uploads. Pick a chunk size (8KB is a fine default) and own the trade-off.
- Progress feedback. Wrap file readers to track bytes-sent and surface percent-complete so the user knows the upload is alive during long transfers.
- Binary downloads. Use
stream=True+iter_content()to download safely in chunks. ValidateContent-Typebefore writing to disk so a server that lies about its response gets caught early. - Concurrent batch processing. Use
ThreadPoolExecutorwith a boundedmax_workersfor upload-bound batches; capture per-file errors so one bad upload does not abort the run. - Document-processing integration. Send a file to an OCR API, handle the success/error envelope, and apply regex-based parsing to lift structured fields out of the text.
Chapter review quiz
Seven questions across the patterns you built. They pressure-test architectural calls (when does streaming matter, why does the demo key throttle, what makes the multipart envelope tick) rather than pure recall. If you can answer confidently, you have the shape:
Why does the chapter use 'rb' for every open(), even when the file is plain text?
'rb' reads raw bytes without UTF-8 decoding. Text mode ('r') corrupts any non-text file because it tries to interpret arbitrary bytes as characters; on Windows it also rewrites \r\n line endings, which mangles binary files. Using binary mode universally means the same code works for images, PDFs, and text without a per-file decision -- and the only cost is that you read bytes instead of str, which requests wants anyway for the multipart body.
You try requests.post(url, files=files, json=metadata) and the server returns a confusing 400. What is wrong, and what should you change?
A single HTTP request can have only one Content-Type header. files= switches the request to multipart/form-data; json= wants application/json. The two are mutually exclusive, so the server sees something other than what you intended (typically the multipart body without the metadata you expected), and you get a confusing 400 back. Fix: move the metadata into data=metadata. The fields then ride along as multipart form parts inside the same request.
A user uploads a 600MB video. Your code calls f.read() and the upload silently disappears with no useful error. What is the failure mode, and why is streaming the fix?
f.read() loads the entire file into memory before requests sends a single byte. On a Hobby-plan or free-tier deployment with around 512MB of RAM, the kernel's OOM killer terminates the Python process before the upload finishes -- which is why "the upload disappears": the process is gone, so there is no error to return. Streaming via a generator (or any iterable that yields chunks) keeps memory pinned at the chunk size (8KB by default). The same code path handles a 1KB receipt and a 5GB video without changing behaviour.
What does stream=True on requests.get() actually defer, and why does that matter for downloads from URLs you do not control?
With stream=True, requests fetches the response headers and leaves the body unread until you call iter_content(). That gap lets you inspect status_code, Content-Type, and Content-Length and decide whether to keep going. For untrusted URLs, this is the difference between catching a content-type mismatch (server says JPEG, sends HTML error page) at the header stage and writing several gigabytes of garbage to disk before noticing.
You set max_workers=50 on a ThreadPoolExecutor for batch uploads to a third-party API and start hitting 429 Too Many Requests. What is the right way to size the pool, and why is "more threads = faster" wrong here?
The bottleneck for upload-bound work is not your CPU; it is the upstream API's rate limit. Most APIs allow somewhere between 3 and 20 concurrent requests per client; once you exceed that, the server starts returning 429 and you get slower, not faster, because of the retries you now have to do. The right size is one or two below the documented concurrent limit (or empirically, the largest max_workers that does not produce 429s under load). The sequential floor (max_workers=1) is also fine for background scripts where total time does not matter; the win from concurrency is real but bounded.
The Receipt Scanner falls back to OCRSPACE_API_KEY = "helloworld". What failure mode does a first-time reader most likely hit, and what is the fix?
"helloworld" is OCR.space's shared demo key, throttled across every developer trying the API for the first time. The most likely failure on a first reader run is an "Auth failure" or "request throttled" response from the API -- the chapter's code catches it as an IsErroredOnProcessing branch and surfaces the message. The fix is one form away: register for a free OCR.space key (around 25,000 requests per month at time of writing) and set OCRSPACE_API_KEY in your .env. The class picks it up via os.environ.get() with no code change.
The Receipt Scanner regex is r"(Total|Amount)[:\s]*\$?([\d,]+\.\d{2})". Name two real receipt formats that break it and explain why a production system would not rely on this regex alone.
The regex assumes a clean line with the literal word Total or Amount, an optional dollar sign, and exactly two decimal places. Real receipts break it in many ways: a German receipt with Gesamt or Summe instead of Total; a UK receipt where the same field is labelled Amount Due with a leading pound sign; a receipt with multiple Total lines (subtotal, tax, total) where the regex catches the first match instead of the right one; a handwritten or low-resolution receipt where OCR returns "Tota1" or "T0tal." Production receipt parsing leans on layout-aware specialist APIs (Mindee, Veryfi, AWS Textract) that combine OCR with field detection rather than free-text regex.
Looking forward
Chapter 21 covered the outbound side of file work: your code pushes data out to an API and waits for a response. Chapter 22, Webhooks and Real-Time APIs, flips the direction. Instead of polling "is the OCR job done yet" every few seconds, you will build receive-only endpoints that let APIs push events to you the moment they happen -- the asynchronous-processing pattern referenced at the end of the receipt-scanner page.
Webhooks are how Stripe tells your application a payment succeeded, how GitHub notifies you of a new pull request, and how Slack delivers message events. The chapter covers receive-only endpoint design, signature verification (so you trust the sender is who they claim to be), and idempotent event handling. By the end, you will be able to both consume APIs (everything in the book so far) and accept their callbacks.