11

The General Question

What are the differences between algorithms using data structures and algorithms using databases?

Some Context

This is a question that has been bugging me for some time, and I have not been able to come up with a convincing answer for it.

Currently, I am working on strengthening my understanding of algorithms that, of course, heavily involve data structures. These are basic structures such as Bag, Queue, Stack, Priority Queue, and Heap.

I also use databases on a daily basis to store the data that has been processed and submitted by the end-user or processed by the program. I retrieve and submit the data through a DAL, which has data structures of its own that are generated based on the tables in the database.

My questions come when I have the option to sort the data using the database to send it back to me ordered in either an ascending/descending fashion or retrieve and load the data into my logic, process this data in a priority queue, and heap sort all of it. Or another one would be to search for records using the database rather than loading a subset of the records and using something like binary search to find the record or records I am interested in.

In my mind, I would try to have as many operations take place on the database-end before sending it over because communication is expensive. This also makes me wonder when do you use algorithms and data structures strictly defined within your own logic rather to process data than that of the database's?

So here are the questions...

Questions

  1. What are the differences between data structures and databases?
  2. When do we use algorithms that use data structures defined solely within your own logic and not that of the database's?
  3. @Harvey post: When do the methods in the database become less efficient to use than methods in your own logic?
    • @mirculixx post: What makes a method efficient?
  4. @Harvey post: How is processing data with data structures faster than doing it in the database?

Clarifications

  1. @Grant post: The databases I normally work with are relational, and these questions are coming out of working with them. However, I do think these questions are applicable to any persistence framework (when I say framework, I mean it in the most general sense).

I know answers without a specific context are difficult. Food-for-thought, advice, or discussion points are mainly what I'm looking for and would be most appreciated!

4
  • The datomic.com database is closer to the user than the traditional relational ones. Are you only looking at the traditional databases?
    – Job
    CommentedJan 4, 2013 at 3:30
  • @Job No, relational databases are not the only thing that I'm considering here. It's more about understanding the difference between data structures in logic versus the data structures in the database/persistence unit.CommentedJan 4, 2013 at 3:37
  • As a general rule I would say - use a database if you can, but if it becomes too slow, then resort to using the data structures. Data duplication (e.g. caching) is bad because you have to keep the two in sync, so avoid it unless you cannot.
    – Job
    CommentedJan 4, 2013 at 3:50
  • Send data to a database only to sort it? Like driving around the block to change your mind?
    – user186205
    CommentedApr 6, 2016 at 16:31

4 Answers 4

19

Data Structures are, for the most part:

  1. Memory resident,
  2. Transient,
  3. Limited in size,
  4. Not re-entrant without adding concurrency mechanisms like locks or immutability,
  5. Not ACID compliant,
  6. Fast, if chosen carefully.

Databases are, for the most part:

  1. Disk-bound,
  2. Persistent,
  3. Large,
  4. Safely concurrent,
  5. ACID compliant, with transactional capabilities,
  6. Slower than data structures

Data structures are meant to be passed from one place to another, and used internally within a program. When was the last time you sent data from a web page to a web server using a database, or performed a calculation on a database that was entirely resident in memory?

Database systems use data structures as part of their internal implementation. It's a question of size and scope; you use data structures within your program, but a database system is a program in its own right.

6
  • Regarding the web page-to-web server remark, I agree you would not use the database there, but I do see the possibility of there being a servlet to handle or translate that data to persist to the database. It is between the middle-tier and the data-tier where things become a bit muddled. To simplify the question, when does the methods in the database become less beneficial to use than methods in the logic?CommentedJan 3, 2013 at 23:55
  • 1
    Well, that's the DAL's bread and butter, isn't it? DAL's exist to ease the transition between objects and database records. DAL's are good for about 80 to 90 percent of what you would want to do with a database but, for the remaining 10 to 20 percent, you might want to drop back to raw SQL or stored procedures, because it is more efficient.CommentedJan 3, 2013 at 23:57
  • In your example of sorting/filtering, you are correct that you probably want to do that kind of processing on the database server. But you would most likely still receive the result of that processing as some form of data structure.CommentedJan 3, 2013 at 23:59
  • The points you've given have been really informative. However, there is still something nagging at me about the methods (or algorithms) that work with the database directly or just with the data structures strictly within the logic or both. I am looking at item 6 of both the lists you put down, and the question that comes to mind is, how is one faster than the other? I have always perceived working with the data at the source is the fastest way to go about things. You can update within your post - I'll reread it.CommentedJan 4, 2013 at 0:32
  • 1
    Databases are slower for a number of reasons. Caching notwithstanding, you have to read the data from the disk, using a SQL statement that has to be compiled, having an execution plan frequently involving multiple tables. The process is much more complex. In addition, you generally still have to transfer the result over the wire, where you translate the data into data structures so that you can work with it.CommentedJan 4, 2013 at 1:00
6

What are the differences between data structures and databases?

On an abstract level, there is none - a database is a data structure.

On a specific level, databases typically have the purpose of persisting data, usually in a format that is optimized for either insertions, updates, retrieval, joining or some other purpose (or a combination).

E.g. if you compare a table in an RDBMS to say an array of data, the difference may be in the run-time of the algorithm, the amount of code you have to write, the amount of memory you need to run the algorithm, or the flexibility to work/access the data from outside of your program/algorithm.

When do we use algorithms that use data structures defined solely within your own logic and not that of the database's?

In tendency I would argue

a) to use a database if you need to persist data in way that is accessible beyond the run-time or purpose of the specific algorithm.

b) to use your own (in-memory) data structure if run-time speed matters, or persistence is not required

E.g. if your algorithm processes customer records, you may want to store those customer records (say to find all customer's in a particular area) for later use by some other program/algorithm and for an entirely different purpose (say to find the most valuable customers). In that case using a database to persist the data is probably a good idea.

Note, however, that there is the concept of in-memory databases that do not necessarily persist data, for performance reasons. E.g. Redis or HANA.

When do the methods in the database become less efficient to use than methods in your own logic?

The answer very much depends on the circumstances and the (type of) database in use. I would rephrase the question to "what makes a method efficient?" It then becomes an exercise of assessing the methods (=algorithm) you would use for you own data structure v.s. the methods used by the database. Also see next point.

How is processing data with data structures faster than doing it in the database?

Again, this depends on the specifics. In general, the processing of data that is in-memory, directly accessible to the process that runs your algorithm, is faster than sending a request to another process (in the same computer or across a network) and asking it to send back the results. However if the data already resides within the database, sending it a command - say an SQL statement to join two tables and calculate some aggregate function - and retrieving only a small summary or subset of the data may be much more efficient than first transferring all the data and calculating the results locally (using your own data structures).

    1

    Disk access is primarily what is most expensive in this operation, more often so than network access (http://serverfault.com/questions/238417/are-networks-now-faster-than-disks). Unless your database is not located on at least a 1 Gbps network and the same network as your web\application server, network performance will not matter as much as disk performance for larger datasets. Or if your data happens to reside on very fast solid state disks which will be faster then typical network access. Also, databases usually provide an IPC mechanism like named pipes instead of using TCP/IP if the database resides on the same server as your application server.

    If you can keep most of\the enire data structure in memory between requests then this will generally be your fastest bet. If you can't, then it is hard to beat a good database structure with normalized tables and proper indices for search and update performance on anything other than small sets of records, especially in a system with millions of records.

    Relational Databases typically use a B+ tree or a variant thereof under the hood and have many optimizations like data alignment on disk and buffer pools for frequently accessed records. This makes them excel at processing large datasets quickly, especially if aggregation or filtering is involved.

    2
    • Please tell me whether I got this right. Applying what you said, whenever I think about working with the data, if I can keep the working set cached in memory, that is faster. Otherwise, try to use the database to deliver those results or find some way to involve querying the database more?CommentedJan 4, 2013 at 14:51
    • @hulkmeister yes generally, unless the dataset is very small or the database is remote to your location on a slow network.CommentedJan 5, 2013 at 14:15
    0

    What do you mean by a database? Do you mean a relational database like MySQL, or SQL Server? A relational database is a meta-data structure that supports some subset of the operations defined by the relational model. The theory of the relational model which was mostly worked out by Edgar Codd in the 60s.

    The relational model is very general purpose and flexible, but that means it can't take any advantage of structure in the data or patterns of access. Data structures are useful when you know something about the data and how it will be accessed. For example, if you know the last data you put into a data structure will be the first data you want out you can use a stack.

    I called the relational database a meta-data structure because it is generally pretty big wad of software that uses lots of data structures like stacks, queues, trees, and lists to create the abstract data structure of a relational table.

    2
    • Sorry, just need a clarification on what "pretty bit wad" means in regards to the last paragraph?CommentedJan 4, 2013 at 3:16
    • @hulkmeister, sorry that should have been 'big' not 'bit'. the relational model is very abstract and fairly complex. Providing an implementation that actually performs adequately, particularly one that provides ACID ( (Atomicity, Consistency, Isolation, Durability) takes a lot of pretty sophisticated code running behind the scenes.
      – user4828
      CommentedJan 4, 2013 at 5:31

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.