3

I'm writing a project for school - Tinder-like app. I thought of implementing database entirely in RDBMS (PostgreSQL). But now I think it's a good opportunity to get to know NoSQL systems a bit. I'm closest to using Redis (seems fast and friendly), but I'm also thinking about:

  • MongoDB (very popular)
  • ElasticSearch (seems to be interesting, it's said to handle complex searches)
  • Neo4j (just because I like graphs)

The problem is that I don't know how shift from the queries I know, to using NoSQL-way.

What I thought of doing To have "core" database in Postgres - it's the information about user in multiple tables, maybe other structured info. The part in NoSQL should store current user location and his reaction to partner recommendation.

Location stores last geolocation data user sent (either updates user location or inserts new row with location)

Suggestion stores user reaction to seeing another users profile

So for preparing partner suggestions the SQL query would go like this (lets assume the query takes input_user, input_gender_preference and input_location as bind parameters; it's just a prototype).

SELECT l.userid FROM location AS l WHERE ST_distance(l.location, input_location) < 1000 AND l.gender = input_gender_preference AND NOT EXISTS (SELECT 'x' FROM suggestion s WHERE (s.userid = l.userid AND s.suggested_userid = input_userid) OR (s.user_id = l.input_userid AND s.suggested_userid = l.userid AND s.approved = TRUE)) 

Ok, but how to make it in any of the systems I listed?

For example

Should I:

  1. split suggestion into approved_by, rejected_by, duplicate entries in reversed key-value: user_approved, user_rejected
  2. get set of users seen by current user, from merging user_approved, user_rejected
  3. get set of users who rejected current user profile from rejected_by
  4. merge the sets above
  5. get a set of users nearby from location
  6. subtract (5) - (4)
  7. filter (6) by gender

It seems like creating a lot of unnecessary traffic.

What I would ask of you

I ask you for starting points - how to handle such situations. I would like to try some non-SQL approaches to handling data.

1
  • The best starting point would be to ask yourself what aspects of the data set justify putting part of it into another form of storage.
    – Blrfl
    CommentedNov 2, 2018 at 10:51

1 Answer 1

3

I don't understand your data model in sufficient design to give you concrete advice, but can provide a few pointers about combining databases. Using multiple databases is perfectly fine because you can play to their individual strengths. Yet there are also a number of drawbacks:

  • By combining multiple databases and other tools, you are introducing extra complexity into your system. Dealing with this costs development time, and makes the system more difficult to set up: another database is one more thing that can fail.

  • You have turned your system into a distributed system. The databases might be inconsistent with each other, e.g. you cannot have foreign-key relationships that span across DBs. Your application has to explicitly implement any consistency requirements.

  • Since you have multiple databases, the database can no longer perform joins. Instead, you have to manually join the data in your application. This usually involves transferring more data than otherwise required. This isn't necessarily a problem, it just makes your work more complicated: you are re-implementing something that a relational database gives you “for free”.

  • Similarly, RDBMS have an awesome and well-understood feature set. NoSQL databases usually give up some of these features and guarantees to gain something else. It is always a tradeoff.

    E.g. you mention MongoDB which is a document database. Instead of tables, you have collections of documents with largely unstructured data. Now you can't have relationships between documents, and the database limits which changes will be performed transactionally.

    And you mention Redis: Redis is very fast because the complete data set is always loaded into RAM. Persistence is optional. Unless you know what you are doing, Redis is unsuitable as a primary data store.

None of these reasons mean that you should not add a NoSQL component to your system, merely that you should consider the costs before you do so.

For storing the locations, you can do with any kind of database as long as it allows efficient spatial queries, i.e. based on distance. No class of database inherently supports this, but many individual databases have GIS extensions. E.g. Redis supports spatial queries, but see my note about persistence above.

If you decide to store locations separately from other user data, you would e.g. first query for users in a specific radius from your locations database, load that list into the application, then query the second database to filter out users on other criteria.

candidates = [] for candidate_batch in batch(100, locationdb.find_nearby_users(user_id)): candidates.add_all(userdb.find_compatible(user_id, candidate_batch)) 

For a school project this is fine, for a real product the queries to the second database might be prohibitively expensive: don't move your data to the code, but your code to the data.

Storing rejections/approvals separately sounds weird to me because this is very relational data, but it could feasible work out. For example, you could first query your primary database for candidates, then query the rejection/approval database for one user's complete list of applicable suggestions and perform a join in your application:

candidates = [] rejections = suggestiondb.find_rejections(user_id) for candidate in userdb.find_compatible(user_id): if not rejections.contains(candidate): candidates.add(candidate) 

Whether any of these approaches is sensible is a numbers game. How much data will each sub-query produce? What fraction of the data will remain in the end? Your application will have to transfer and process all the results from the subqueries, which for a production app could be a lot. Again, this is not a problem for a student project, but e.g. Stack Overflow recently had an outage because they managed to saturate the network connection between their web servers and their database. I don't think they are doing joins in code, they simply have a lot of traffic – still, any technology has limitations.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.