Modern data practice and the SQL tradition
- Beware the schemaless nature of NoSQL systems, which can easily lead to sloppy data modeling at the outset. Start with an RDBMS in the first place, preferably with a JSON data type and indices on expressions, so you can have a single database for both structured and unstructured data and maintain ACID compliance.
- Bring ETL closer to the data and be wary of decentralized data cleaning transformation. Push data cleaning to the database level wherever possible - use type definitions, set a timestamp with timezone policy to enable ‘fail fast, fairly early’, use modern data types such as date algebra or geo algebra instead of leaving that for Pandas and Lambda functions, employ triggers and stored procedures.
- Create more features at the query level to gain flexibility with different feature vectors, so that model selection and evaluation are quicker.
- Distributed systems like MongoDB and ElasticSearch can be money-hungry (both in terms of technology and human resources), and deployment is harder to get right with NoSQL databases. Relational databases are cheaper, especially for transactional and read-heavy data, more stable and perform better out of the box.
- Be very meticulous as debugging is quite difficult for SQL, given its declarative nature. Also, be mindful of clean code and maintainability.
Full post here, 13 mins read