#database
14 posts

How to choose the right database

Relational (SQL-based) databases - such as MySQL, PostgreSQL, Oracle, and MS SQL Server - can be queried with SQL-like languages. They allow indexing and faster updating, but total querying time can be slow for large tables.
Read more

How to choose the right database

  • Relational (SQL-based) databases - such as MySQL, PostgreSQL, Oracle, and MS SQL Server - can be queried with SQL-like languages. They allow indexing and faster updating, but total querying time can be slow for large tables.
  • Their simple structure matches most data types and they support atomic transactions but not OOP-based objects. They require vertical scaling (10-100TB), which requires downtime when adding resources to machines.
  • NoSQL databases typically use JSON records with no common schema. There are four main types of NoSQL databases.
  • Document-oriented databases make each document into a JSON and allow indexing by field (so all records must have that field); they support big data analytics using parallel computations (examples are MongoDB, CouchDB, and DocumentDB).
  • Columnar databases store data by the column (querying by subsets of a column is fast), allowing better data compression. They are commonly used for data modeling and logging (an example is Cassandra).
  • Key-value databases allow only key-based querying, which is fast, and each record has a ‘time to live’ (TTL) field deciding when it gets deleted. They are great for caching, but because of using RAM storage, are expensive (examples include Redis and Memcached).
  • Graph databases use nodes to represent entities and edges to represent their relationship; they are good for knowledge graphs and social networks (examples include Neo4J or InfiniteGraph).

Full post here, 7 mins read

Modern data practice and the SQL tradition

Create more features at the query level to gain flexibility with different feature vectors, so that model selection and evaluation are quicker.
Read more

Modern data practice and the SQL tradition

  • Beware the schemaless nature of NoSQL systems, which can easily lead to sloppy data modeling at the outset. Start with an RDBMS in the first place, preferably with a JSON data type and indices on expressions, so you can have a single database for both structured and unstructured data and maintain ACID compliance.
  • Bring ETL closer to the data and be wary of decentralized data cleaning transformation. Push data cleaning to the database level wherever possible - use type definitions, set a timestamp with timezone policy to enable ‘fail fast, fairly early’, use modern data types such as date algebra or geo algebra instead of leaving that for Pandas and Lambda functions, employ triggers and stored procedures.
  • Create more features at the query level to gain flexibility with different feature vectors, so that model selection and evaluation are quicker.
  • Distributed systems like MongoDB and ElasticSearch can be money-hungry (both in terms of technology and human resources), and deployment is harder to get right with NoSQL databases. Relational databases are cheaper, especially for transactional and read-heavy data, more stable and perform better out of the box.
  • Be very meticulous as debugging is quite difficult for SQL, given its declarative nature. Also, be mindful of clean code and maintainability.

Full post here, 13 mins read

Eventual vs strong consistency in distributed databases

Ensure you replicate data for storage - in the case of databases, redundancy introduces reliability.
Read more

Eventual vs strong consistency in distributed databases

  • Ensure you replicate data for storage - in the case of databases, redundancy introduces reliability.
  • For consistency across multiple database replicas, a write request to any node should trigger write requests for all replicas.
  • In an ‘eventual consistency’ model, you can achieve low latency for read requests by delaying the updates to replicas, but you will risk returning stale data to read requests from some nodes if the update has not reached them yet.
  • With a ‘strong consistency’ model, write requests to replicas will be triggered immediately. However, they will delay subsequent read/write requests to any of the databases until the consistency is reached.

Full post here, 4 mins read

Databases that play nice with your serverless backend

Modern cloud-native DBs, that expose stateless REST APIs, work best for serverless computing.
Read more

Databases that play nice with your serverless backend

  • Modern cloud-native DBs, that expose stateless REST APIs, work best for serverless computing.
  • Amazon’s Aurora Serverless variant scales to demand, resizes to meet storage capacity and handles routine maintenance as a managed service. Its Data API feature works with SQL DBs.
  • With Amazon DynamoDB, designing your data to work it while supporting access patterns is hard. Complex queries can be tricky too.
  • Microsoft’s Azure Cosmos works with several APIs & supports multiple data models. It lets you choose the exact tradeoffs between performance & consistency.
  • Google’s Cloud Firestore works well for web and mobile apps. It has built-in security without requiring a backend and it syncs across clients in realtime.

Full post here, 7 mins read

The big bad guide on database testing

Check for data mapping, ACID properties and data integrity of your DB, and ensure they implement your business logic accurately.
Read more

The big bad guide on database testing

  • Check for data mapping, ACID properties and data integrity of your DB, and ensure they implement your business logic accurately.
  • The most common test techniques are transactions for the ACID properties, checks of keys and indices, parameters and results of stored procedures, evaluation of triggers and field constraints, etc.
  • Stress testing and performance testing are critical too.
  • SQL queries are the most reliable way to qualitatively test apps with low or medium complexity.
  • Use a database testing tool that considers the business, data access and UI as well as the database itself.

Full post here, 8 mins read