8

I'm designing a system for a client where the requirements are:

  • they upload a JSON file (one object/line)
  • make a call to an API with the JSON object as the payload
  • record the state (success/failure) of each API call in a database
  • make one retry if there's a failure.

I decided to build it out using celery and a sqlite database as the backend. The number of JSON lines is not large — perhaps a couple million at most — which will fit in memory. I have all the individual components working fine (can upload file, can read file, can call API, can write to db, etc.), but I am not sure about the overall architecture of dispatching tasks using celery.

Assuming there are N lines in the file, should I:

Option A:

  1. Create N objects in the database with a result column (initially null).
  2. Create N celery tasks and pass the object id as the parameter and the payload
  3. Make the subtask call the API and update the object's result field to success/failure.
  4. Let celery's retry feature attempt to call the API again in case of failure.

Option B:

  1. Create N objects in the database with a result column (initially null).
  2. Create 1 celery task and pass the entire list of N object ids and N payloads
  3. Loop through all N objects and update the database with the result at each step.
  4. When the previous task finishes, it fires another one-time celery task that reads the database for all objects with failure result and retries them.

I'm favoring option A because of its simplicity but I don't know what the limits are on the number of celery tasks that can be scheduled and if the broker (RabbitMQ) will handle it. With option B, the big risk is that if the celery task were to get terminated for any reason at some line M, then all following objects will never be tried.

Any thoughts on these two or if there's a third better alternative?

    1 Answer 1

    1

    Option A sounds like the way to go as you can set workers it the API independently instead of a huge tasks that only a single worker can manage.

    I have a very similar scenario using kombu and Celery:

    • We get a message posted to RMQ by some integration to a RMQ queue
    • We have a kombu consumer draining events
    • When event is received we execute the callback (post to local python queue)
    • celery gets the message sent over the python queue and process it
    • Once finished it returns the results and we relay the message to a kombu producer
    • The producer posts back to RMQ

    As you can see, we use basically the same approach as you. We have this in production handling around 2000 messages per hour with no problem.

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.