Monthly Archives: February 2014
- Introduction to NoSQL by Martin Fowler
- NoSQL Distilled@martinfowler.com
- NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence
Looking around the NoSQL resources, and watched/read the above ones. It has great explanation for the advantage/disadvantages of NoSQL approach over traditional relational databases. The followings are some notes.
Application development productivity
A lot of application development effort is spent on mapping data between in-memory data structures and a relational database. A NoSQL database may provide a data model that better fits the application’s needs, thus simplifying that interaction and resulting in less code to write, debug, and evolve.
For application developers, the biggest frustration has been what’s commonly called the impedance mismatch: the difference between the relational model and the in-memory data structures.
NoSQL’s is often driven by the scaling capability through clusters, but also development productivity is one major factor. Mapping database with in-memory data structure has been the pain in the neck during the application development. Though ORM (Hibernate, ActiveRecord, etc.) alleviates some of them, it still requires certain care and efforts to gain both productivity and effective performance.
Aggregate orientation takes a different approach. It recognizes that often, you want to operate on data in units that have a more complex structure than a set of tuples.
As we’ll see, key-value, document, and column-family databases all make use of this more complex record.
Aggregate is a term that comes from Domain-Driven Design. In Domain-Driven Design, an aggregate is a collection of related objects that we wish to treat as a unit.
In relational databases, data is normalized and split into multiple tables. Instead, NoSQL is storing the primary data with related objects into single item. This approach focuses on maintaining data integrity for each item, rather than a transaction that handles multiple independent tables and rows, which RDB is taking. This aggregate-oriented approach makes it easier to distribute the data over multiple cluster nodes.
A common statement about NoSQL databases is that since they have no schema, there is no difficulty in changing the structure of data during the life of an application. We disagree; a schemaless database still has an implicit schema that needs change discipline when you implement it
The claim that NoSQL databases are entirely schemaless is misleading; while they store the data without regard to the schema the data adheres to, that schema has to be defined by the application, because the data stream has to be parsed by the application when reading the data from the database.
NoSQL’s flexible scheme allows concentrating on the domain design, but schemaless database still has an implicit schema that needs change discipline when you implement it. Also, as the NoSQL’s aggregated data is not normalized, analyzing the data from different perspective from the primary-key, and it requires to create index (materialized views) for them. These factors need to be taken cared.
Consistency and Availability
The CAP theorem states that if you get a network partition, you have to trade off availability of data versus consistency.
NoSQL advocates the capability of data distribution. However, there’s a trade-off between consistency and availability. It often involves business decision about which is important for the provided services.
The aboves are nice introductory presentations about docker from the CTO Solomon Hykes.
an open-source project to easily create lightweight, portable, self-sufficient containers from any application. The same container that a developer builds and tests on a laptop can run at scale, in production, on VMs, bare metal, OpenStack clusters, public clouds and more.
according to the official site.
I was looking for a way to get many servers for testing erlang/elixir distribution features, and then I’m experimenting on docker lately. Docker provides a easier way to get multiple isolated environments, compared with vagrant VMs (VirtualBox, or AWS EC2/DigitalOcean).
In the above presentation, granularity of the “conainer” to ship applications is being discussed. One approach is to deploy applications along with a new VM, and another approach is just deploy application-package itself. Docker is heading between them using light-weight linux-container.
Also, docker is focused on shipping applications from containers to containers. Docker Index provides a way to post/get pre-configured containers. It’s interesting idea to isolate applications in the container and just ship them to another location, in a unified form.
I was trying Chef for a while for deploying applications (elixir and dynamo). Chef is one good tool, but it’s a little difficult to write good cookbooks. It tries to hide the complexity of different OS environments, but it often involves environment-specific logics. Also, chef’s approach is to unify the application deployment procedure, and it involves starting-up new VM many times. Then, the procedure becomes relatively heavy-weight and takes time.
If the docker’s approach can remove these pain by unifying the underlying environments, it’s very interesting.
The downside of docker’s approach may be some complexity at connecting containers, due to the additional layer of containers on top of the VM. Chef Versus Docker at RelateIQ has some description about port forwarding, in order to expose services in the container to the external world (Around 6:00). It involves different ports at multiple layers, and seems a little difficult. Link Container might be a solution, but I need more study and experiment for understanding it better.
I was playing around Dynamo with AngularJS, and resulted in a simple scaffold application.
Dynamo itself doesn’t have datastore, so Ecto with PostgreSQL is applied too. Some features are missing compared with Rails’ ActiveRecord, but CRUD operations just worked by adding some of the additional logics (ex. Ecto seems not have built-in json. The above one is just manually extracting fields from Ecto.Model).
Tried to compare it with previously created rails version.
- Macbook Air
- PostgreSQL 9.3.2
- Acquiring 10 records of json
- 1000 requests with 100 concurrency (ab -n 1000 -c 100 http://localhost:xxxx/crews)
Then the [Requests per second] resulted in the following.
- Dynamo version: 481.90 #/sec
- Rails version: 172.42 #/sec
I haven’t been able to match the condition well (ex. Total transferred bytes are larger in rails), but I can see the certain amount of difference between them.
Server Software: Server Hostname: localhost Server Port: 4000 Document Path: /crews Document Length: 420 bytes Concurrency Level: 100 Time taken for tests: 2.075 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Total transferred: 565000 bytes HTML transferred: 420000 bytes Requests per second: 481.90 [#/sec] (mean) Time per request: 207.510 [ms] (mean) Time per request: 2.075 [ms] (mean, across all concurrent requests) Transfer rate: 265.89 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 1.1 0 5 Processing: 7 197 27.4 201 254 Waiting: 7 197 27.4 201 254 Total: 12 198 26.7 201 257
Server Software: thin Server Hostname: localhost Server Port: 3000 Document Path: /crews Document Length: 412 bytes Concurrency Level: 100 Time taken for tests: 5.800 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Total transferred: 787000 bytes HTML transferred: 412000 bytes Requests per second: 172.42 [#/sec] (mean) Time per request: 579.982 [ms] (mean) Time per request: 5.800 [ms] (mean, across all concurrent requests) Transfer rate: 132.51 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 1 0.9 0 5 Processing: 280 557 53.8 559 783 Waiting: 216 502 54.4 506 735 Total: 284 558 53.3 559 785
A very nice keynote about using simple hand tools for productive work. It starts with the examples of “wood work” (like furnitures) for describing the importance of understanding materials and tools well.
- The material in the programming world is data structure. Persistent data with solid structure is required.
- Functions and semantics are the gateway to operate on the materials.
Automation with heavy-weight tools can be good for productivity, but it also requires certain costs. It changes the way to look at the problem, and we could lose some insights behind the scene.
Functional programming languages like closure has nicely organized immutable/concurrent data structures with simple interfaces, compared with relatively complex set of classes and objects. Simple data structures provides great flexibility.
The automation part reminds me of Chef, which I was trying on Chef recently. It’s a good automation tool with a concept of simple idempotency policy, but it involves much complexity behind the scene. Maybe blindly using them might be dangerous without understanding the underlying architecture. Or, applying more simpler type of approach like Sunzi would be another option.
We may have to carefully select the tools and materials, before going deeper into the problem.
A nice talk on Vim and Tmux features. I’ve once tried Vim several times, but ended up with Sublime after certain amount of struggling. But, every-time I see the expert vim user operations, I become inclined to try it again.
I haven’t tried Tmux before, and played around some after waching this. Basically, it works nicely, but haven’t been able to key factor to change. I don’t usually use so many windows, and basic terminal tabs and the following bashmark is good enough. Maybe need to look into a little more deeper.