6

I have a software project A which makes API calls to a third-party software B that is heavily based on data stored on the file system. Also, those software and file systems are distributed on servers in different locations. I have been thinking of ways of not using a file system, and using a database instead for storing BLOBs, for example. Let's suppose the following scenario:

  1. A calls B.
  2. B needs a template file that is on the file system.
  3. If the template file is accessible on the file system, B processes successfully. If not, it returns error.

Within this scenario, I thought about two options:

  1. Using a database to store BLOBs of those template files. Then, I would have A extract the files from the database and save it on the file system whenever a call to B is made. After the call returns, A would update the template file on the database and delete it from the file system. With this scenario, I would pretty much need only to replicate/distribute the database storage.
  2. Using a file system with replication to all locations A and B are running at. It would be easier for me to implement, but I would lose all features a database storage has such as queries, statistics...

I have been trying to find solutions for this problem, and I have come across DBFS on Oracle databases. It seems to me that it creates a file system interface for external access, while, in reality, those files are being storaged and administrated on a relational database. In theory it would solve my problems with file systems, but I am not able to test/use this product since I do not have an Oracle database. I have tried to look for simillar features on other open source databases, without any success. Maybe I am not searching correctly, or maybe there is a correct terminology for this that I am not aware of. It seems to me like a simple problem that many people have probably faced at some point.

  1. Am I going right about it?
  2. What are the best practices on dealing with file system dependant applications?
  3. Is there any way that I can make use of the advantages of databases while maintaining a file system accessible to this third-party software?
3
  • Making a driver? You could mount whatever as a file system if you make a driver.
    – Theraot
    CommentedFeb 3, 2020 at 13:54
  • You mean making a driver to connect the file system to the database? I am actually looking for a more "out of the box" solution. It does not look to me like an unique problem...CommentedFeb 3, 2020 at 14:04
  • Having blob storage outside the database is usually better. If you don't have access to blob storage in the cloud, perhaps you could set up an equivalent service with something like min.io or equivalent. You can make the files accessible where you need it, and while databases can store large blobs, they aren't usually the most efficient at it.CommentedFeb 3, 2020 at 14:13

1 Answer 1

4

There are a lot of trade-offs in dealing with files in a database. Due to some practical constraints database manufacturers provide a means of storing the files outside the database file structure to speed up the ability to transfer the file to the client. Oracle has DBFS, Microsoft has FileStreams, FileTable, and Remote Blob Store (RBS). The rough trade-offs are:

  • Files managed in the database are also maintained in the backups
    • Good for ease of backups
    • Bad for size of backups--particularly if backups need to be transferred across the network in a specific period of time
  • Files rarely have the original filename while they are on disk
    • You can't rely on third parties accessing the file directly
  • Some databases are insanely expensive (looking at you Oracle)

Handling the templating problem, you have to decide whether the the template files are data or configuration. The distinction is in how you solve the problem.

Treating template files as data

This assumption essentially relates the templates to specific records in your database. They need to be staged appropriately when your data is being processed, and then cleaned up afterwards so that they don't interfere with other processing. In this case, I would recommend wrapping the access to your 3rd party software with a service that will retrieve the templates from blob storage, and then clean up once the processing is done. That maintains the Single Responsibility Principle (SRP) so that the caller doesn't have to directly know about the templates.

In this case I do recommend external blob storage.

  • If you are hosting in a cloud use the blob storage capability with your cloud provider (AWS S3, Azure Blob Storage, GCP Cloud Storage)
  • If you are hosting on premises, use something like min.io to provide a distributed means of hosting the blob storage with the same interface as cloud blob storage

Typically managed blob storage is far more robust than your own hosted solution, even in your database, and many times much faster to work with. You don't have to worry too much about backups in that scenario. While Min.io allows you to work with blob storage like it is in the cloud, you'll have to manage the deployment yourself. That said, if you use an orchestration service like Kubernetes (k8s), you can maintain a very robust deployment yourself.

Treating Template Files as Configuration

In this scenario, the template files just need to be present on the filesystem, and it is considered to be part of a valid deployment. This is where containers and orchestration (k8s) really make a good option to manage that. The container would be built to have the 3rd party tool and all the template files as required. Each instance of the container deployed by your orchestration service will be configured identically.

This approach also works for desktop applications. The template files would be part of the installer in this case.

6
  • Thank you for the info. So, the reason why having a wrapper is better than DBFS, for example, is not clear to me (besides money, of course). Is there really no way of forcing the files stored on disk to have the same name as the original ones?CommentedFeb 3, 2020 at 14:59
  • Again, what I was thinking about was an "out of the box" wrapper, fully integrated to a database... Like a mask that presents a file system, but is a database storage under the hood. Maybe developing the wrapper is the only way.CommentedFeb 3, 2020 at 15:01
  • Also, the second trade-off presented is (apparently) dealt with on Oracle's DBFS: "Files in the database can now be transparently accessed using any operating system (OS) program that acts on files. For example, ETL (Extract, Transform and Load) tools can transparently store staging files in the database."CommentedFeb 3, 2020 at 15:06
  • @GabrielPimenta, The database manages the files on disk. For example, MS SQL Server stores the files by a GUID which is referenced in the FILESTREAM column. The original filename would have to be an additional attribute you store in the database if you want to restore them. I'm not as well versed in Oracle's DBFS, but I suspect the truth will be similar since all databases have to deal with files that have the same name but different content. It's still not clear to me whether in your case the files are data or configuration. That's why I gave you two options.CommentedFeb 3, 2020 at 15:37
  • In my case the files are data. I want a third-party application to access database managed files via file system path, as if they were on a standard file system. From what you are describing, I see no downsides of using DBFS or FILESTREAM when compared to a wrapper. Can you elaborate a little more on why you recommend wrappers?CommentedFeb 3, 2020 at 16:26

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.