1

Here's the scenario: I have item x (item_id, customer_number, cost) which can be submitted to another system multiple times over months which may or may not reject item x before finally accepting it at some point.

Reporting requirement: Arbitrary date range for start date and end date Want to return unique number of item which were rejected in time range (note that if item x was rejected three times in this time range, we would only want to count it once).

Realistic Example data:

Item X Rejected 01/02/2014 Item X Rejected 01/03/2014 Item X Rejected 02/15/2014 

If I want to run the report for 01/01/2014 to 02/01/2014 for item x I should only get a count of 1 for the range. It gets weird when I run the report for January then also run it for February because the same item should show up as a count of 1 for both months, but it I run it for the first 3 months of the year, it still only should show up as a count of 1.

The problem: I am dealing with billions of records on the database. Normally we would just pre-calculate totals for the data and bucket it by month. We can't do that in this case because when running for arbitrary date ranges February and January couldn't be totaled in the prior example because that would result in a count of 2 for item x instead of the unique total of 1.

Question: Is there a way to pre-calculate data for unique counts like this for arbitrary date ranges? Does anyone have any suggestions for optimization of reporting here (not involving throwing more hardware at it)?

We are using Oracle Database 11g.

4
  • If the only interesting option here is if the item exists in the time range why do you even count? (instead using "select count(*) ... group by item_id, customer_id" maybe?) Also how arbitrary are your date ranges? Are those always months? Or could it also be something like 15.1.2014 to 30.1.2014 or even something like calendar weeks? You could have buckets (as for example months) and a unique key on bucket_id (the month) and item_id (and maybe customer_id if that is relevant in the context) so duplicates are rejected on the db level.CommentedFeb 23, 2015 at 19:08
  • This is about putting the data into a table for faster lookup. This is a table with 1.5 billion records. You can't just go group by and count(*), it would take tremendous hardware to make a query like that execute in the desired time range.
    – Reimius
    CommentedFeb 23, 2015 at 22:31
  • CREATE OR REPLACE MATERIALIZED VIEW will be your friend here. Materialized View Concepts here. Share and enjoy.CommentedJun 2, 2015 at 21:48
  • A MATERIALIZED VIEW is just a method to hold the data, which is not what I'm looking for. I'm interested in a general algorithm on how to calculate the bucketed data, not the method to store it.
    – Reimius
    CommentedJun 3, 2015 at 16:38

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.