This worklog has been replaced with mariadb.org/jira

This site is here for historical purposes only. Do not add or edit tasks here!

 
 
 

WorkLog Frontpage Log in / Register
High-Level Description | Task Dependencies | High-Level Specification | Low-Level Design | File Attachments | User Comments | Time Estimates | Funding and Votes | Progress Reports

 Add support for dynamic columns
Title
Task ID34
Queue
Version
Status
Priority
Copies toSergei

Created byPsergey21 Jul 2009Done
Supervisor   
Lead Architect    
Architecture Review  
Implementor  
Code Review  
QA  
Documentation  
 High-Level Description
Add support for dynamic columns:

- A column that can hold information from many columns 
- One can instantly add or remove column data

This is a useful feature for any store type of application, where you want to
store different type of information for different kind of items.

For example, for shoes you want to store: material, size, colour, maker
For a computer you want to store ram, hard disk size etc...

In a normal 'relational' system you would need to a table for each type.
With dynamic columns you have all common items as fixed fields (like
product_code, manufacturer, price) and the rest stored in a dynamic column.

The proposed idea is to store the dynamic information in a blob in
Google Protocol Buffers (further GPB) format and use SQL constructs to
extract parts of GPB data for use in select list, for filtering, and so forth.

Any support for indexing GPB data is outside of scope of this WL entry.

Future ideas:
- Allow indexing with name instead of numbers. When this is done we can drop the
type as part of column_get()
- Allow indexing on dynamic fields.


Syntax examples:

Creating table (no change from before, 'extra' is used for dynamic column data):
CREATE TABLE item (ID int auto_increment primary_key, Type_id int, Price
decimal(7,2), Country_id int, Manufacturer_id int, extra column_blob);

-------------

Creating a column with many fields:
Syntax: COLUMN_CREATE(column_nr, value, [column_nr,value]...)

INSERT into item (NULL, 1 /* T-shirt */, 10, 1 / * Germany /, 1 /* Nike /,
COLUMN_CREATE(1 /* color /, “Blue”, 2 /* Size */, “M”));

-------------

Updating a dynamic column:
Syntax: COLUMN_ADD(blob,column_nr, value, column_nr,value]...)

UPDATE item SET extra=COLUMN_ADD(extra, 6 /* Memory */, 2048) WHERE id=2;

If the column already exists, it will be overwritten.

-------------

Deleting a dynamic column (if it exists):
Syntax: COLUMN_DELETE(blob, column_nr, column_nr...);

UPDATE item SET extra=COLUMN_DELETE(extra, 6) WHERE id=2;

-----------------

Querying a dynamic column:

Syntax: COLUMN_EXISTS(blob, column_nr);

SELECT * from item where COLUMN_EXISTS(extra,4);

-----------------

Querying which columns exists:

Syntax: COLUMN_LIST(blob, column_nr);

SELECT COLUMN_LIST(extra) FROM item WHERE id=1;
→ “1,2”

------------------
Retrieving a dynamic column:

Syntax: COLUMN_GET(string, column_nr AS type);

SELECT id, COLUMN_GET(extra, 1 /* color*/ AS char) from item;

This will return NULL if the column don't exists
 Task Dependencies
Others waiting for Task 34Task 34 is waiting forGraph
 
 High-Level Specification
1. GPB Encoding overview
2. GPB in an SQL database
3. Encoding to use for dynamic columns
4. How to store and access data in a protocol buffer from SQL
5. Extensions for the future

1. GPB Encoding overview
========================

GBB is a compact encoding for structured and typed data. A unit of GPB data
(it is called message) is only partially self-describing: it's possible to 
iterate over its parts, but, quoting the spec

http://code.google.com/apis/protocolbuffers/docs/encoding.html:
  " the name and declared type for each field can only be determined on the
  decoding end by referencing the message type's definition (i.e. the .proto 
  file). "

2. GPB in an SQL database
=========================

It is possible to store GPB data in MariaDB today - one can declare a binary
blob column and use it to store GPB messages. Storing and retrieving entire
messages will be the only available operations, though, as the server has no
idea about the GPB format. 
It is apparent that ability to peek inside GPB data from SQL layer would be of
great advantage: one would be able to 
- select only certain fields or parts of GPB messages
- filter records based on the values of GPB fields
- etc
performing such operations at SQL layer will allow to reduce client<->server 
traffic right away, and will open path to getting the best possible
performance.

3. Encoding to use for dynamic columns
======================================

The data should be coded into the proto buffer in the following format:

<field_number><value_type><value>[<field_number><value_type><value>...]

Where field_number is a number between 0-65536 that identifes the field
<value_type> is a enum of type 'Item_result'
<value> is the value coded in proto format.

In other words, we should have no nested or complex structure.

4. How to store and access data in a protocol buffer from SQL
============================================================

User-friendly/meaningful access to GPB fields requires knowledge of
GPB field names and types, which are not available from GPB message
itself (see "GPB encoding overview" section).

To make things easy for the user, we will at first stage provide SQL
functions to manipulate a string that is actually in proto format.

The functions we should provde are:

proto_get(gpb, field_number, type)

This return the field tagged with 'field_number' from the 'gpb' buffer.

Example:  proto_get(blob, 1, varchar)     -> Returns field number 1 as varchar

proto_put(gpb, field_number, value)

This returns a new gbp buffer with the new value appended.

Example: proto_put(proto_put(blob, 1, 1), 2, "hello")

5. Extension for future
=======================

In the future we may want to access data based on name and get MariaDB to
automaticly know the correct type.  To do this we need to be able to
store a definition for the content of the proto buffer somewhere.

DecisionToMake: How to pass the server the GPB definition?
First idea: add a CREATE TABLE parameter which will specify the
definition itself.
 Low-Level Design
* We should have both server-side support and client-side support (client side
  means functions in libmysqlclient so that user can select the full BLOB and
  extract fields in the application).

The current proposal for storage is:

Header: <flag><number_of_columns>
Sorted index: <column_nr><offset><column_nr><offset> 

The flag can be used to distinguish between versions, the length of offset (2 or
3 bytes), storage format and other things.

Each column_nr will take 2 bytes and offset 2 or 3 bytes (depending on length of
total string). This is to allow to do a fast binary search to check if a column
exists.

Each column is stored as:
<character_set if string><data>

We don't have to store lengths as this can be calculated from the offsets.
Numbers will be stored without pre-zero.

API for library used for dynamic columns:

int dynamic_column_create(DYNAMIC_COLUMN *str,
                          uint column_nr,
                          DYNAMIC_COLUMN_VALUE *value);

int dynamic_column_create_many(DYNAMIC_COLUMN *str,
                               uint column_count,
                               uint *column_numbers,
                               DYNAMIC_COLUMN_VALUE *values);

int dynamic_column_update(DYNAMIC_COLUMN *org, uint column_nr,
                          DYNAMIC_COLUMN_VALUE *value);
int dynamic_column_update_many(DYNAMIC_COLUMN *str,
                               uint add_column_count,
                               uint *column_numbers,
                               DYNAMIC_COLUMN_VALUE *values);

int dynamic_column_delete(DYNAMIC_COLUMN *org, uint column_nr);

/* test existance of non-NULL column */
int dynamic_column_exists(DYNAMIC_COLUMN *org, uint column_nr);

/* List of not NULL columns */
int dynamic_column_list(DYNAMIC_COLUMN *org, DYNAMIC_ARRAY *array_of_uint);

/*
   if the column do not exists it is NULL
*/
int dynamic_column_get(DYNAMIC_COLUMN *org, uint column_nr,
                       DYNAMIC_COLUMN_VALUE *store_it_here);

#define dynamic_column_column_free(V) dynstr_free(V)

#define dynamic_column_value_init(V) (V)->type= DYN_COL_NULL

Structure definitions:

#define ER_DYNCOL_YES 1
/* NO and OK is the same used just to show semantics */
#define ER_DYNCOL_NO 0
#define ER_DYNCOL_OK 0
/* Wrong format of the string with encoded columns */
#define ER_DYNCOL_FORMAT -1
/* Some limit reached */
#define ER_DYNCOL_LIMIT -2
/* Out of resourses */
#define ER_DYNCOL_RESOURCE -3
/* Incorrect input data */
#define ER_DYNCOL_DATA -3

typedef DYNAMIC_STRING DYNAMIC_COLUMN;

enum enum_dynamic_column_type
{
  DYN_COL_NULL= 0,
  DYN_COL_INT,
  DYN_COL_UINT,
  DYN_COL_DOUBLE,
  DYN_COL_STRING,
  DYN_COL_DECIMAL,
  DYN_COL_DATETIME,
  DYN_COL_DATE,
  DYN_COL_TIME
};

typedef enum enum_dynamic_column_type DYNAMIC_COLUMN_TYPE;

struct st_dynamic_column_value
{
  DYNAMIC_COLUMN_TYPE type;
  union
  {
    long long long_value;
    unsigned long long ulong_value;
    double double_value;
    struct {
      LEX_STRING string_value;
      CHARSET_INFO *charset;
    };
    struct {
      decimal_digit_t decimal_buffer[DECIMAL_BUFF_LENGTH];
      decimal_t decimal_value;
    };
    MYSQL_TIME time_value;
  };
};

Ideas:
- Use linear search instead of string search with small set of columns
  - In this case we can store lengths instead of offsets, to speed up
modifications of the array.
- Store type as part of offset (3 bit). As this will limit us to 8K byte offset,
this means we must at the same time support 2 and 3 byte offsets.

DATA ENCODING

NULL - No field mentioned

Length of the field will be get from offsets of two fields.

Decimal and string values will has numeric unsigned prefix coded as variable
length unsigned int. For decimal it will be point position for string number of
collation.
 File Attachments
 NameTypeSizeByDate
 User Comments
(Rspadim - 2011-08-14 07:01:50
    hi, nice work! now, some ideas... i'm using today JSON inside blob columns to simulate dynamic columns... maybe we could implement something for JSON_dynamic_columns what you think? the diferences: binary format of mariadb is less space consuming (no problem, today i'm using JSON and space isn't a problem for me) JSON allow array, and objects, in PHP language it's a array with a string key $array['some_key_in_String_format']='some_value_maybeanother_array', $array['firstkey']['secondkey]='some value inside the second array'; in this case we should allow something like key "[firstkey][secondkey]" as a key, allowing to access value from first key and after value from second key well these are only examples, maybe we could use php serialize function, or anyother, just some ideas.... maybe at mariadb implemented functions we could allow a "format" parameter, to help on slecting the type of value at changing, deleting and reading dynamic column value for json maybe we should use some standard formats like, for integers-bigint, for floats-double, for strings - blob/longblob/varbinary(255)/other? for true/false-enum('y','n')? for array or other not sql compatible type - NULL thanks guys! good job!
 Time Estimates
NameHours WorkedLast Updated
Total0 
 Hrs WorkedProgressCurrentOriginal
This Task0160160
Total0160160
 
 Funding and Votes
Votes: 1: 33%
 Change vote: Useless    Nice to have    Important    Very important    

Funding: 0 offers, total 0 Euro
 Progress Reports
(Monty - Thu, 30 Jun 2011, 10:58
    
Status updated.
--- /tmp/wklog.34.old.16237	2011-06-30 10:58:17.000000000 +0000
+++ /tmp/wklog.34.new.16237	2011-06-30 10:58:17.000000000 +0000
@@ -1,2 +1,2 @@
-In-Progress
+Complete
 

(Sanja - Mon, 18 Apr 2011, 11:44
    
High Level Description modified.
--- /tmp/wklog.34.old.5715	2011-04-18 11:44:35.000000000 +0000
+++ /tmp/wklog.34.new.5715	2011-04-18 11:44:35.000000000 +0000
@@ -75,9 +75,9 @@
 ------------------
 Retrieving a dynamic column:
 
-Syntax: COLUMN_GET(column_nr FROM blob AS type);
+Syntax: COLUMN_GET(string, column_nr AS type);
 
-SELECT id, COLUMN_GET(1 /* color*/ FROM extra AS char(255)) from item;
+SELECT id, COLUMN_GET(extra, 1 /* color*/ AS char) from item;
 
 This will return NULL if the column don't exists
 

(Sanja - Wed, 13 Apr 2011, 19:08
    
Low Level Design modified.
--- /tmp/wklog.34.old.25881	2011-04-13 19:08:53.000000000 +0000
+++ /tmp/wklog.34.new.25881	2011-04-13 19:08:53.000000000 +0000
@@ -22,35 +22,60 @@
 
 API for library used for dynamic columns:
 
-typedef DYNAMIC_STRING DYNAMIC_COLUMN;
+int dynamic_column_create(DYNAMIC_COLUMN *str,
+                          uint column_nr,
+                          DYNAMIC_COLUMN_VALUE *value);
+
+int dynamic_column_create_many(DYNAMIC_COLUMN *str,
+                               uint column_count,
+                               uint *column_numbers,
+                               DYNAMIC_COLUMN_VALUE *values);
+
+int dynamic_column_update(DYNAMIC_COLUMN *org, uint column_nr,
+                          DYNAMIC_COLUMN_VALUE *value);
+int dynamic_column_update_many(DYNAMIC_COLUMN *str,
+                               uint add_column_count,
+                               uint *column_numbers,
+                               DYNAMIC_COLUMN_VALUE *values);
 
-my_bool dynamic_column_create(DYNAMIC_COLUMN *value, int column_nr,
-DYNAMIC_COLUMN_VALUE *value);
+int dynamic_column_delete(DYNAMIC_COLUMN *org, uint column_nr);
 
-my_bool dynamic_column_create_many(DYNAMIC_COLUMN *value, uint column_count, int
-*column_numbers, DYNAMIC_COLUMN_VALUE *values);
+/* test existance of non-NULL column */
+int dynamic_column_exists(DYNAMIC_COLUMN *org, uint column_nr);
 
-my_bool dynamic_column_update(DYNAMIC_COLUMN *org, int column_nr,
-DYNAMIC_COLUMN_VALUE *value);
+/* List of not NULL columns */
+int dynamic_column_list(DYNAMIC_COLUMN *org, DYNAMIC_ARRAY *array_of_uint);
 
-my_bool dynamic_column_delete(DYNAMIC_COLUMN *org, int column_nr);
+/*
+   if the column do not exists it is NULL
+*/
+int dynamic_column_get(DYNAMIC_COLUMN *org, uint column_nr,
+                       DYNAMIC_COLUMN_VALUE *store_it_here);
 
-my_bool dynamic_column_exists(DYNAMIC_COLUMN *org, int column_nr);
-- Returns 1 if value exists
+#define dynamic_column_column_free(V) dynstr_free(V)
 
-/* List of not NULL columns */
-my_bool dynamic_column_list(DYNAMIC_COLUMN *org, DYNAMIC_ARRAY *array_of_int);
+#define dynamic_column_value_init(V) (V)->type= DYN_COL_NULL
 
-my_bool dynamic_column_get(DYNAMIC_COLUMN *org, int column_nr,
-DYNAMIC_COLUMN_VALUE *store_it_here);
-Returns 1 if column number didn't exists.
+Structure definitions:
 
-void dynamic_column_free(DYNAMIC_COLUMN *value);
+#define ER_DYNCOL_YES 1
+/* NO and OK is the same used just to show semantics */
+#define ER_DYNCOL_NO 0
+#define ER_DYNCOL_OK 0
+/* Wrong format of the string with encoded columns */
+#define ER_DYNCOL_FORMAT -1
+/* Some limit reached */
+#define ER_DYNCOL_LIMIT -2
+/* Out of resourses */
+#define ER_DYNCOL_RESOURCE -3
+/* Incorrect input data */
+#define ER_DYNCOL_DATA -3
 
-Structure definitions:
+typedef DYNAMIC_STRING DYNAMIC_COLUMN;
 
-enum DYNAMIC_COLUMN_TYPE
+enum enum_dynamic_column_type
 {
+  DYN_COL_NULL= 0,
   DYN_COL_INT,
   DYN_COL_UINT,
   DYN_COL_DOUBLE,
@@ -58,24 +83,29 @@
   DYN_COL_DECIMAL,
   DYN_COL_DATETIME,
   DYN_COL_DATE,
-  DYN_COL_TIME, 
-  DYN_COL_NULL  
+  DYN_COL_TIME
 };
 
-struct DYNAMIC_COLUMN_VALUE
+typedef enum enum_dynamic_column_type DYNAMIC_COLUMN_TYPE;
+
+struct st_dynamic_column_value
 {
-  enum DYNAMIC_COLUMN_TYPE type;
+  DYNAMIC_COLUMN_TYPE type;
   union
   {
-    longlong long_value;
+    long long long_value;
+    unsigned long long ulong_value;
     double   double_value;
-    {
+    struct {
       LEX_STRING string_value;
       CHARSET_INFO *charset;
-    }
+    };
+    struct {
+      decimal_digit_t decimal_buffer[DECIMAL_BUFF_LENGTH];
     decimal_t decimal_value;
+    };
     MYSQL_TIME time_value;
-  }
+  };
 };
 
 Ideas:

(Monty - Sat, 12 Mar 2011, 10:45
    
Category updated.
--- /tmp/wklog.34.old.26161	2011-03-12 10:45:13.000000000 +0000
+++ /tmp/wklog.34.new.26161	2011-03-12 10:45:13.000000000 +0000
@@ -1,2 +1,2 @@
-Server-BackLog
+Server-Sprint
 

(Monty - Sat, 12 Mar 2011, 10:45
    
Implementor updated:  -> Sanja

(Sanja - Mon, 07 Mar 2011, 08:57
    
Low Level Design modified.
--- /tmp/wklog.34.old.4265	2011-03-07 08:57:57.000000000 +0000
+++ /tmp/wklog.34.new.4265	2011-03-07 08:57:57.000000000 +0000
@@ -38,6 +38,7 @@
 my_bool dynamic_column_exists(DYNAMIC_COLUMN *org, int column_nr);
 - Returns 1 if value exists
 
+/* List of not NULL columns */
 my_bool dynamic_column_list(DYNAMIC_COLUMN *org, DYNAMIC_ARRAY *array_of_int);
 
 my_bool dynamic_column_get(DYNAMIC_COLUMN *org, int column_nr,
@@ -57,7 +58,8 @@
   DYN_COL_DECIMAL,
   DYN_COL_DATETIME,
   DYN_COL_DATE,
-  DYN_COL_TIME   
+  DYN_COL_TIME, 
+  DYN_COL_NULL  
 };
 
 struct DYNAMIC_COLUMN_VALUE
@@ -83,27 +85,14 @@
 - Store type as part of offset (3 bit). As this will limit us to 8K byte offset,
 this means we must at the same time support 2 and 3 byte offsets.
 
-DATA ENCODING:
+DATA ENCODING
 
-  NULL:
-    a) No field mentioned
-    b) Field mentioned but offset is 0 (or other impossible value)
-
-  For encoding unsigned int we could use method from protobuff - variable length
-integers. where first bit of 8 set to 1 if next byte belong to this number,
-other 7 bit used for number itself. Could be used for storing (decreasing
-probability):
-  1) string length
-  2) unsigned int
-  3) parts of decimal numbers
-  For encoding signed numbers it could be coded like:
-  0 -> 0
-  1 -> 1
- -1 -> 2
-  2 -> 3
- -2 -> 4
-  ...
-  and then as unsigned described above (it is also taken from protobuff). Could
-be used for storing sugned integers, signed decimals.
+NULL - No field mentioned
+
+Length of the field will be get from offsets of two fields.
+
+Decimal and string values will has numeric unsigned prefix coded as variable
+length unsigned int. For decimal it will be point position for string number of
+collation.
 
 

(Sanja - Mon, 07 Mar 2011, 08:38
    
Title modified.
--- /tmp/wklog.34.old.3370	2011-03-07 08:38:51.000000000 +0000
+++ /tmp/wklog.34.new.3370	2011-03-07 08:38:51.000000000 +0000
@@ -1,2 +1,2 @@
-Add support for dynamic columns (via google protocol buffers)
+Add support for dynamic columns
 

(Sanja - Fri, 25 Feb 2011, 09:19
    
Low Level Design modified.
No change.

(Sanja - Fri, 25 Feb 2011, 09:18
    
Low Level Design modified.
--- /tmp/wklog.34.old.22411	2011-02-25 09:18:52.000000000 +0000
+++ /tmp/wklog.34.new.22411	2011-02-25 09:18:52.000000000 +0000
@@ -1,4 +1,3 @@
-
 * We should have both server-side support and client-side support (client side
   means functions in libmysqlclient so that user can select the full BLOB and
   extract fields in the application).
@@ -84,4 +83,27 @@
 - Store type as part of offset (3 bit). As this will limit us to 8K byte offset,
 this means we must at the same time support 2 and 3 byte offsets.
 
+DATA ENCODING:
+
+  NULL:
+    a) No field mentioned
+    b) Field mentioned but offset is 0 (or other impossible value)
+
+  For encoding unsigned int we could use method from protobuff - variable length
+integers. where first bit of 8 set to 1 if next byte belong to this number,
+other 7 bit used for number itself. Could be used for storing (decreasing
+probability):
+  1) string length
+  2) unsigned int
+  3) parts of decimal numbers
+  For encoding signed numbers it could be coded like:
+  0 -> 0
+  1 -> 1
+ -1 -> 2
+  2 -> 3
+ -2 -> 4
+  ...
+  and then as unsigned described above (it is also taken from protobuff). Could
+be used for storing sugned integers, signed decimals.
+
 

(Sergei - Tue, 22 Feb 2011, 17:36
    
Observers changed: Sergei
-- View All Progress Notes (50 total) --


Report Generator:
 
Saved Reports:

WorkLog v4.0.0
  © 2010  Sergei Golubchik and Monty Program AB
  © 2004  Andrew Sweger <yDNA@perlocity.org> and Addnorya
  © 2003  Matt Wagner <matt@mysql.com> and MySQL AB