I'm designing a system for a client where the requirements are:
- they upload a JSON file (one object/line)
- make a call to an API with the JSON object as the payload
- record the state (success/failure) of each API call in a database
- make one retry if there's a failure.
I decided to build it out using celery and a sqlite database as the backend. The number of JSON lines is not large — perhaps a couple million at most — which will fit in memory. I have all the individual components working fine (can upload file, can read file, can call API, can write to db, etc.), but I am not sure about the overall architecture of dispatching tasks using celery.
Assuming there are N lines in the file, should I:
Option A:
- Create N objects in the database with a
result
column (initially null). - Create N celery tasks and pass the object id as the parameter and the payload
- Make the subtask call the API and update the object's result field to success/failure.
- Let celery's retry feature attempt to call the API again in case of failure.
Option B:
- Create N objects in the database with a
result
column (initially null). - Create 1 celery task and pass the entire list of N object ids and N payloads
- Loop through all N objects and update the database with the result at each step.
- When the previous task finishes, it fires another one-time celery task that reads the database for all objects with failure result and retries them.
I'm favoring option A because of its simplicity but I don't know what the limits are on the number of celery tasks that can be scheduled and if the broker (RabbitMQ) will handle it. With option B, the big risk is that if the celery task were to get terminated for any reason at some line M, then all following objects will never be tried.
Any thoughts on these two or if there's a third better alternative?