Apache Kudu Schema Design and Usage Limitations

The following sections describe known issues and limitations in Kudu, as of the current release.

Schema Design Limitations

Primary Key

The columns which make up the primary key must be listed first in the schema.

  • The columns which make up the primary key must be listed first in the schema.

  • Columns with DOUBLE, FLOAT, or BOOL types are not allowed as part of a primary key definition. Additionally, all columns that are part of a primary key definition must be NOT NULL.

  • The primary key of a row may not be modified using the UPDATE functionality. To modify a row’s primary key, the row must be deleted and re-inserted with the modified key. Such a modification is non-atomic.

  • Columns that are part of the primary key cannot be renamed. The primary key may not be changed after the table is created. You must drop and recreate a table to select a new primary key or rename key columns.

Valid Identifiers

Identifiers such as table and column names must be valid UTF-8 sequences and no longer than 256 bytes.

Number of Columns

By default, Kudu will not permit the creation of tables with more than 300 columns. We recommend schema designs that use fewer columns for best performance.

Size of Cells

No individual cell may be larger than 64KB. The cells making up a composite key are limited to a total of 16KB after the internal composite-key encoding done by Kudu. Inserting rows not conforming to these limitations will result in errors being returned to the client.

Size of Rows

Kudu was primarily designed for analytic use cases. Although individual cells may be up to 64KB, and Kudu supports up to 300 columns, it is recommended that no single row be larger than a few hundred KB. You are likely to encounter issues if a single row contains multiple kilobytes of data.

Non-alterable Partitioning

Kudu does not allow you to change how a table is partitioned after creation, with the exception of adding or dropping range partitions.

Non-alterable Column Types

Kudu does not allow the type or nullability of an existing column to be altered.

Partition Splitting

Partitions cannot be split or merged after table creation.

Dropping Columns and Tables

Dropping a column does not immediately reclaim space. Compaction must run first. There is no way to run compaction manually, but dropping the table will reclaim the space immediately.

If you are using Apache Impala (incubating) to query Kudu tables, refer the section on Impala Integration Limitations as well.

Partitioning Limitations

  • Tables must be manually pre-split into tablets using simple or compound primary keys. Automatic splitting is not yet possible. Range partitions may be added or dropped after a table has been created. See Apache Kudu Schema Design for more information.

  • Data in existing tables cannot currently be automatically repartitioned. As a workaround, create a new table with the new partitioning and insert the contents of the old table.

Replication and Backup Limitations

  • Kudu does not currently include any built-in features for backup and restore. Users are encouraged to use tools such as Spark or Impala to export or import tables as necessary.

Impala Integration Limitations

  • When creating a Kudu table, the CREATE TABLE statement must include the primary key columns before other columns, in primary key order.

  • Impala cannot update values in primary key columns.

  • Impala cannot create Kudu tables with TIMESTAMP, DECIMAL, VARCHAR, or nested-typed columns.

  • Kudu tables with a name containing upper case or non-ASCII characters must be assigned an alternate name when used as an external table in Impala.

  • Kudu tables with a column name containing upper case or non-ASCII characters may not be used as an external table in Impala. Non-primary key columns may be renamed in Kudu to work around this issue.

  • Kudu tables containing UNIXTIME_MICROS-typed columns may not be used as an external table in Impala.

  • NULL, NOT NULL, !=, and LIKE predicates are not pushed to Kudu, and instead will be evaluated by the Impala scan node. This may decrease performance relative to other types of predicates.

  • Updates, inserts, and deletes using Impala are non-transactional. If a query fails part of the way through, its partial effects will not be rolled back.

  • The maximum parallelism of a single query is limited to the number of tablets in a table. For good analytic performance, aim for 10 or more tablets per host or use large tables.

Impala Keywords Not Supported for Creating Kudu Tables

  • PARTITIONED
  • LOCATION
  • ROWFORMAT

Spark Integration Limitations

  • Kudu tables with a name containing upper case or non-ASCII characters must be assigned an alternate name when registered as a temporary table.

  • Kudu tables with a column name containing upper case or non-ASCII characters may not be used with SparkSQL. Non-primary key columns may be renamed in Kudu to work around this issue.

  • NULL, NOT NULL, <>, OR, LIKE, and IN predicates are not pushed to Kudu, and instead will be evaluated by the Spark task.

  • Kudu does not support all types supported by Spark SQL, such as Date, Decimal and complex types.

Security Limitations

  • Authentication and authorization features are not implemented.

  • Data encryption is not built in. Kudu has been reported to run correctly on systems using local block device encryption (e.g. dmcrypt).

Other Known Issues

The following are known bugs and issues with the current release of Kudu. They will be addressed in later releases. Note that this list is not exhaustive, and is meant to communicate only the most important known issues.

  • If the Kudu master is configured with the -log_force_fsync_all option, tablet servers and clients will experience frequent timeouts, and the cluster may become unusable.

  • If a tablet server has a very large number of tablets, it may take several minutes to start up. It is recommended to limit the number of tablets per server to 100 or fewer. Consider this limitation when pre-splitting your tables. If you notice slow start-up times, you can monitor the number of tablets per server in the web UI.