16

I have multiple data sources that I need to search across and return back to the client (web app).

For example the sources are:

  1. an elastic search index
  2. a sql database

Is there an efficient way to perform paging across two sources? At the moment I am searching on one, and then reducing the searchable items on the second, then paging only.

Alternative options:

  • Ideally, I would like to move one source into the other, but for various reasons (e.g. space constraints, pricing etc.) this seems not a viable option.
  • Disabling the search until a more refined criteria is placed in, so the returning result set is guaranteed to be smaller and thus paging is of less importance.

Without the paging, the performance of this aspect of the application is not great when the search criteria is more open.

Are there any approaches for this nature of searching?

5
  • 1
    I would say that the design is fundamentally flawed when you use two different sources. Instead I would suggest indexing both sources in a separate index.
    – superhero
    CommentedDec 18, 2017 at 11:26
  • Can you elaborate on what exactly are in both sources? At first it looks like a contradiction because the idea of pagination over multiple data sources implies both sources contains same data types (you can't paginate apples from one source and pears from second source). So, if in both sources you have same data why to store it in two sources? Definitely something is missing from this picture, so please explain in more details the data model.
    – catta
    CommentedDec 17, 2018 at 13:54
  • This is over a year now since I asked this question. A postmortem on it: The conclusion I came to was that it's fundamentally not possible to paginate over multiple data sources while applying some sort of filtering. The solution I settled on was unioning the results of each data query on a common property (a shared id in my case) in memory, then applying pagination. Once I settled on this approach, I focused my efforts on speeding up the queries. I prevented scenarios requiring large pieces of data in memory to be held in memory by making wide queries invalid at the UI & AP levelI.
    – Prabu
    CommentedDec 17, 2018 at 14:47
  • In an ideal world, I would be combining the data sources into a new data source, with some sort of event/messaging system to keep the computed data source up to date when the original changes. This would required larger changes and access to modifiy the way the original data sources are managed/accessed.
    – Prabu
    CommentedDec 17, 2018 at 14:50
  • 1
    Duplicate: softwareengineering.stackexchange.com/questions/279115/…
    – Mr. TA
    CommentedFeb 26, 2023 at 17:56

5 Answers 5

15
+50

Get the data into a single index

The simplest, no muss, no fuss solution.

But then again, why would anything Enterprise be simple?

Supply two result sets

The best you can do is to provide the first page of each sources answer. If either source runs dry, simply return their set as empty. Don't be tempted to provide more results from the other source, because if the dry source suddenly fills up you are going to have user confused as to why some results are repeated.

Merge at the client

Alternately if you have some measure of control over the client, you can list pages from the api from both sources, and use the quality metric to sort the returned data into pages for the user. You will need to ensure that you have the next item (or end of data) from both sources to ensure a good merge for that page. This will place some burden on the users computer, so make sure their system is up to the intended load.

Messy Hack - Here for completeness avoid if at all possible.

There is a rather bad hack that you could do. It would provide the almost illusion of a unified data source. It is however woefully inefficient and breaks basic encapsulation. Add a parameter per data source to act as an item offset. To produce a Page of N items, run a query against each datasource for the offset + N items. Merge these in the API and return the top N items, along with updated offsets for the next page.

Choices

Fight for a single index, and use the two result sets as the alternative. Seriously hide the fact that you could merge the data at the client, or at the api. You don't want to have to undo those later, and the business team will both expect that you can do this for every data source now, and complain bitterly when it is no longer responsive, and what the costs in man hours are to fix it. It is simply best to deny them now and get the work done to support this going forward.

    5

    I don't think that you can come up with anything better here - basically, you're trying to solve an unsolvable math problem, which is joining two sets... without joining them.

    Technologies like ElasticSearch were constructed to approach this problem by having a single data set to work on.

    So the way I see it either have to join you data sources by feeding (at least partially) the data to some third cache or live with where you are...

      4

      It's often very difficult or impossible to combine multiple data sources into a single index. For example, if the underlying sources change often, you end up having to maintain essentially a replication process, which has a big number of challenges of its own. Also, in my case, I have to query from 10+ database tables, and the sheer amount of work to combine all that data into a single table is cost prohibitive. There is also the challenge of the "unified" index becoming too large.

      I solved it like this: (page size is 20 for conversation's sake)

      • to get page 1, I get items 0-20 from each source, filtering and sorting during retrieval; combine the results, applying sorting; the temporary result set can be something like 70 rows; I return the first 20 to the client
      • to get page 2, I get items 0-40 from each source, similarly - filter and sort during retrieval, then combine with sorting; the temporary result is something like 130 rows; I return rows 20-40 from the temporary result set to the client.

      This can be inefficient if the user pages very far ahead, from data retrieval load on the database and memory consumption perspectives, but thankfully, this is a UI concern, so the users are unlikely to go beyond only a few initial pages, so it's not a problem.

      1
      • This can be optimized if you sort by time of entries. Take the time of last entry from a page and when querying for the next page, you can query for entries after that time. That way you can always ask for the same amount of rows.CommentedMar 20, 2023 at 9:02
      0

      Imho doing pagination over multiple data sources with included filtering for key features that need to be performant is a design flaw. The data needs to be indexed in a single database to be efficiently queried. But if you have to do it, here are two solutions. E.g. for doing analytics, reports over multiple databases.

      postgres foreign data wrapper

      In a project I faced this exact issue, where combining the data into a single database was forbidden. In the end we settled on using postgres foreign data wrapper to establish a direct connection between both databases.

      How to use

      As indices cannot be efficiently used this way, it is important to reduce the amount of data transferred over the network.

      • make the database with the larger data amount the server and the smaller one the client
      • DO NOT make join queries to filter, as this will transfer huge amounts of data over the network
      • Use Select (id) Where statements on the client database to select only the primary keys for each row. Then run the same query on server database and join the two sets of ids and return only ids which exist in both sets.
      • fetch the remaining attributes for each id and return the data

      To improve the performance, a cache can be used on the server database postgres instance. Another possibility is to use materialized views on the client database to precompute values in case the filtering does not change that often. E.g. filter by customer already to reduce total amount of needed filtering on request.

      Instead of querying the client database directly use a view instead on the client database. Then map the internal tables of the client database to the view. The server database then queries only the view. In this way the view works as the defined public API of the client database, kind of a service contract. This will make it easier to develop both databases independent as long as the internals are correctly mapped to the view.

      GraphQl

      Use graphql to merge the queries instead. This will be slower and more complex as the aggregation is not directly done on the database layer but makes it possible to include other data sources like REST APIs. Overall the issue will be the same, that each data source needs to be queried separately for their supersets of matching Ids. Then match up the Ids from the different data sources and query the remaining data for the match Ids, then apply pagination and cache a couple of pages in the graphql layer.

        -1

        I'm suggesting script like this to solve the problem I'm not sure if this works for you

        stock=[20,22,24,26,28,30] fund=[1,2,10,10,15,20] # 30,28,26,24,22,20,20,15,10,10,2,1 def market_paginated(page, per_page,stock_page,stock_start,fund_page,fund_start,stock_count=0,fund_count=0): stock_skip=stock_start stock_take=per_page fund_skip=fund_start fund_take=per_page stock=stock_paginated(stock_page,stock_skip,stock_take) fund=fund_paginated(fund_page,fund_skip,fund_take) current_stock=0 current_fund=0 result=[] while current_stock<len(stock) and current_fund<len(fund) and len(result)<per_page: if stock[current_stock]>fund[current_fund]: result.append(stock[current_stock]) current_stock+=1 stock_count+=1 else: result.append(fund[current_fund]) current_fund+=1 fund_count+=1 while current_stock<len(stock) and len(result)<per_page: result.append(stock[current_stock]) current_stock+=1 stock_count+=1 while current_fund<len(fund) and len(result)<per_page: result.append(fund[current_fund]) current_fund+=1 fund_count+=1 return result, int((stock_count)/per_page)+1, int((fund_count)/per_page)+1 ,(stock_count)%per_page, (fund_count)%per_page, stock_count, fund_count def stock_paginated(page,skip,take): stock.sort(reverse=True) # add skip to start start=(page-1)*take+skip end=start+take return stock[start:end] def fund_paginated(page,skip,take): fund.sort(reverse=True) # add skip to start start=(page-1)*take+skip end=start+take return fund[start:end] 

          Start asking to get answers

          Find the answer to your question by asking.

          Ask question

          Explore related questions

          See similar questions with these tags.