Understanding NoSQL

发布时间：2020-12-13 13:45:10 所属栏目：百科来源：网络整理

导读：Understanding NoSQL 1.What is NoSQL Agenda Common Traits(特点) Consistency Indexing Queries MapReduce Sharding NoSQL Common Traits Non-relational Non-schematized/schema-free Eventual Consistency Open source Distributed "Web scale" Develope

Understanding NoSQL

1.What is NoSQL

Agenda

Common Traits(特点)

Consistency

Indexing

Queries

MapReduce

Sharding

NoSQL Common Traits

Non-relational

Non-schematized/schema-free

Eventual Consistency

Open source

Distributed

"Web scale"

Developed at big internet companies

Consistency

CAP Theorem

Databases may only excel at tow of the following thress attributes:

consistency,availability and partition tolerance

NoSQL does not offer ‘ACID' guarantees

Atomicity,consistency,isolation and durability

Instead offers 'eventual consistency'

Similar to DNS propagation

Indexing

Most NoSQL databases are indexed by key

Some allow so-called 'secondary' indexed

Often the primary key indexes are clustered

Hbase uses Hadoop Distributed File System,which is append-only

Writes are logged

Logged writes are batched

File is re-created and sorted

Queries

Typically no query language

Instead,create procedural program

Sometimes SQL is supported

Sometimes MapReduce code is used ...

MapReduce

Map step:split the query up

Reduce step:merge the results

Most typical of Hadoop and used with Wide Column Stores,esp,Hbase

Amazon Web Service's Elastic MapReduce(EMR) can read/write DynamoDB,s3,Relational Database Service(RDS)

"Hive" offers a HiveSQL(SQL-like) abstraction over MR

Use with Hive tables

Use with Hbase

Sharding

A partition pattern where separate servers store partitions

Fan-out queries supported

Partitions may by duplicated,so replication also provided

Good for disaster recovery

Since 'shards" can be geographically distributed,sharding can act like a CDN

Good for keeping data close to processing

Reduces network traffic when MapReduce splitting takes place

2.NoSQL Technology Breakdown

Agenda

Key-Value Stores

Wide-Column Stores

Document Stores

Demo Couchdb

Graph Databases

Key-Value "mechanics" present throughout

Key-Value Stores

The most common;not necessarily the most popular

Has rows,each with something like a big dictionary|associative array

Schema may differ from row to row

Common on Cloud platforms

e.g,Amazon SimpleDB,Azure Table Storage

MemcachedDB,Voldemort

DynamoDB(AWS),Dynomite,Redis and riak

Document Stores

Have 'databases',which are akin(类似) to tables

Have 'documents',akin to rows

Documents are typically JSON objects

Each document has properties and values

Values can be scalars,arrays,links to documents in other databases or sub-documents(i.e,contained JSON objects - Allow for hierarchical storage)

Can have attachments as well

Old versions are retained

So Doc Stores work well for content management

Some view doc stores as specialized KV stores

Most popular with developers,startups,VCs

The biggies:

CouchDB

MongoDB

Document Store Application Orientation

Documents can each be addressed by URIs

CouchDB supports full REST interface

Very geared towards JavaScript and JSON

Documents are JSON objects

CouchDB|MongoDB use JavaScript as native language

In CouchDB,'view functions' also have unique URIs and they return HTML

so you can build applications in the database

Demo CouchDB

http://127.0.0.1:5984/pluralsight/_design/example/_view/dotNet

http://127.0.0.1:5984/pluralsight/_design/example/_view/dataacess

http://127.0.0.1:5984/pluralsight/_design/example/_show/showfunction/_id

Wide Column Stores

Has tables with declared column families

Each column family has "columns" with are KV pair that can vary from row to row

These are the most foundational for large sites

BigTable(Google)

Hbase(Originally part of Yahoo-dominate Hadoop project)

Cassandra(Facebook)

Calls column families "super columns" and tables "super column families"

They are the most "Big Data"-ready

Especially Hbase + Hadoop

Graph Databases

Great for social network applications and other where relationships are important

Node and edges

Edge like a join

Nodes like rows in a table

Nodes can also have properities and values

Neo4j is a popular graph db

3.Where is a NoSQL Killer App

Agenda

Content Management

Product Catalogs

Social

Big Data

Miscellaneous

Content Management

Document databases work really well here

Regular KV pairs can store meta data

Can also store text-based content

Attachments can store file-based or binary content

Versioning and URI addressability help as well

CouchDB gets called a 'Web database'

Database for Web apps

Database that can contain Web apps

Think Web sites,not Browser-based LOB applications

Think EverNote

Product Catalogs

Products is a catalog tend to have many attributes in common and then various others that are class-specific

Common

ProductID

Name

Description

Price

Class-Specific

Flavor,Color

Resolution,Clockspeed

Key Value Stores and Wide Column Stores work well here

KV Stores better when schema will change over time

Since nothing is declared

Social

Graph databases work best here

Great for tracking:

Networks

Followers

Group membership

Threaded interactions(comments,likes/favorites)

Great for Membership,Ownership

Avoids the self-joins and many-to-many table necessary in relational DBs

Big Data

Wide Column and Key-Value stores work best here

MapReduce is designed for this scenarios

Hadoop and Hbase come up a lot

Sharding and append-only help here

Premise of analytics is reading data,not maintaining it

This is perfect for NoSQL

Aggregation,Correlation,regression do not require formal schema,or sophisticated query capabilities

Just need to read and perform mathematical operations on data really,really quickly

Miscellaneous

Event-driven data(i.e,logs)

User Profiles,preferences

Mail,status message streams

Other Web data

Automobile directions

info for sites on maps(category,name,description,lat/long,photo)

User reviews

Etc.

4.What Good is Relational

Agenda

Transactional

Formal Schema

Line of Business Applications

Declarative Query

Banded Reporting

Transactional

Business systems require atomic transactions

You can't process an order without decrementing inventory(清单)

You can't register a credit without its corresponding debit

No exceptions,no excuses

Formal Schema

Regular processes have regular data

Stocks,trades

PO line items

Personnel records

Insurance policies

These need relational databases with declared schema

These don't need MapReduce,document or graph representation

Line of Businesses Applications

Screen layouts and data binding require consistent schema

Data Transfer Objects have properties defined in code

You can't have strong typing without a schema

Object Relational Mapping

Object models are mapped to database schema

If the schema is not consistent then the mapping can't be either

Declarative Query

I silly to write imperative code for each routing query

Makes ad hoc queries and reporting difficult

Lose out on engine optimization

Lose out on versatility(多功能性)

Imperative query works best when the range of queries is very small

Relational stored procedures do set precedent for pre-written queries,but they still don't iterate through data sets imperatively

Banded Reporting

Operational reporting is based on detail and group sections with predictable,consisent layout,based on known schema

Very hard to design pixel-perfect reports against indeterminate schema

You can dump all columns/all rows,but that's generic

Forms are formal,by definition

This highlights how operational business processes almost always require relational databases

5.NoSQL and Microsoft

Agenda

Azure Table Storage

SQL Server/Azure XML Columns

SQL Azure Federations

Demo

OData

MongoDB on Azure

Hadoop on Azure/Windows

Demo

SQL Server "Beyond Relational"

SQL Server Parallel Data Warehouse

Azure Table Storage

Cloud-based Key-Value Store

Supports OData interface(more on that later)

Key-Value works nicely for general pupose storage and retrieval

SQL Server Data Services (precursor to SQL Azure) also implemented a Key-Value store

SQL Azure XML Columns

XML columns hold structured data that can differ between rows

Combining scalar and XML columns allows combination of static and dynamic schemas

XML schemas can still be declared

But you can have more than one

And it's not required

If motivation to use NOSQL is loose schema,then consider XML columns

To prove the point:Azure Dev Fabric's Table Storage is implemented with SQL Server Express and XML columns

SQL Azure Federations

Federations are the SQL Azure version of sharding

Just for partitioning,not for replication

Replication is automatic,implicit in SQL Azure

Federation Root (physical & logical db,defines F.Key)

F.Member(physical db - contains specific range of F.Key values)

F.Atomic Unit(AU - container for all data with same F.Key value)

F.Table

F.Members can be addrssed by absolute name or relative key value

Allow online repartitioning

Offer ACID guarantees withing F.Members and adopt Evetual Consistency between them

Multi-tenancy(租用) applications

Do not support fan-out query

OData

RESTful api for data access,with rendering in XML or JSON

Clients for JavaScript,mobile platforms,.NET,Java

Works for feeds and updates

The following feature OData interfaces:

Azure Table Storage

SQL Server/Azure(via WCF Data Services)

Azure DataMarket

SQL Server Reporting Services (in 2008 R2.2012)

SharePoint Lists(2010)

NetFlix,eBay catalogs;TwitPic

IBM WebSphere eXtrem Scale REST data service

Pluralsight catalog!

Compare to JavaScript/JSON orientation of Document Stores

Run MongoDB,others on Azure

Deploy to worker roles

Put databases in Azure Blog Storage;mount as drives(Azure Drive)

MongoDB Replica Set Azure wrapper supports this directly

Use from on-premise or cloud application code

Similar approach can be used for other NoSQL DBs

Hadoop on Azure/Windows

MS + HortonWorks have developed Windows Version of Hadoop

Currently in Community Technology Preview

Can use installer to create cluster

On-premises

On Azure

Can also use Hadoop On Azure

Provision entire cluster from Portal

Currently has 48-hour lifetime

Browser-based Hive console

Hive ODBC Driver

Use from Excel (with add-in)

Also use from PowerPivot,Analysis Services(2012 Tabular Mode),Reporting Services

SQL Server "Beyond Relational" Features

XML Columns(already discussed)

HierarchyId

Sparse columns(SQL Server-only)

Filestream(SQL Server-only)

Allow schema flexibility while retaining ACID guarantees

SQL Server Parallel Data Warehouse Edition(SQL PDWE)

Makes a cluster of SQL Server instances appear as on logical server

Uses MPP:Massively Parallel Processing

Compare to MapReduce

Supports SQL,so no imperative coding needed

Supports fan-out queries

Supported by most SQL Server clients

Available only as appliance

Has finely tuned processor,storage,networking internals

6.NoSQL,Relational or Both?

Agenda

Type of App

Productivity

Skill Sets and investment

Recommendations

Type of App

Really a question of consistency versus massive scale

Is this an internal system or a public one?

Is is an application for the data or data for a system?

Below a certain threshold of concurrent usage,NoSQL may e slower than relational

Productivity

NoSQL db tooling still immature

Queries require significant work,and testing

Programming platforms,frameworks and components may support RDBMSes much more robustly

Especially enterprise platforms

If schema subject to frequent change then NoSQL may be more productive

Skill Sets and investment

Does your staff have RDBMS skills already?

Do you have significant investment in relational database hw/sw?(hardware/software)

Lots of apps that use an RDBMS?

Do you want to retool(改革)?

Do you want to support both?

Are you a startup?

Employ developers who possess NoSQL skills and prefer NoSQL?

Does availability/scalability make RDBMS investment questions moot?

Recommendations

Large,public,content-centric properties:NoSQL

Internal LOB(line of business) supporting business operations:relational

Investment in RDBMS licenses,infrastructure,skills:

Relational

Use both (application-dependent)

Use Hybrid approaches

Productivity

Do cost-benefit analysis

How much extra dev times/$$?

What is cost of less scalable system?

It will be tempting ot use one for the other

And it very well may work,but that doesn't make it right

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!