Apache Kudu Release Notes

Introducing Apache Kudu

If you are new to Kudu, check out the list of benefits and features.

Resources

Installation Options

  • A Quickstart VM is provided to get you up and running quickly.
  • You can install parcels or packages in clusters managed by Cloudera Manager, or packages in standalone CDH clusters.
  • You can build Kudu from source.

See Installing and Upgrading Apache Kudu for full details.

Kudu 1.2.x Release Notes

New Features in Kudu 1.2.0

See also Issues resolved for Kudu 1.2.0 and Git changes between 1.1.x and 1.2.x.

New Features
  • Kudu clients and servers now redact user data such as cell values from log messages, Java exception messages, and Status strings. User metadata such as table names, column names, and partition bounds are not redacted.
  • Kudu's ability to provide consistency guarantees has been substantially improved:
    • Replicas now correctly track their "safe timestamp". This timestamp is the maximum timestamp at which reads are guaranteed to be repeatable.
    • A scan created using the SCAN_AT_SNAPSHOT mode will now either wait for the requested snapshot to be "safe" at the replica being scanned, or be re-routed to a replica where the requested snapshot is "safe". This ensures that all such scans are repeatable.
    • Kudu Tablet Servers now properly retain historical data when a row with a given primary key is inserted and deleted, followed by the insertion of a new row with the same key. Previous versions of Kudu would not retain history in such situations. This allows the server to return correct results for snapshot scans with a timestamp in the past, even in the presence of such "reinsertion" scenarios.
    • The Kudu clients now automatically retain the timestamp of their latest successful read or write operation. Scans using the READ_AT_SNAPSHOT mode without a client-provided timestamp automatically assign a timestamp higher than the timestamp of their most recent write. Writes also propagate the timestamp, ensuring that sequences of operations with causal dependencies between them are assigned increasing timestamps. Together, these changes allow clients to achieve read-your-writes consistency, and also ensure that snapshot scans performed by other clients return causally-consistent results.
  • User data in log files is now redacted by default.
  • Kudu servers now automatically limit the number of log files being stored. By default, 10 log files will be retained at each severity level.
Optimizations and Improvements
  • The logging in the Java and cpp clients has been substantially quieted. Clients no longer log messages in normal operation unless there is some kind of error.
  • The cpp client now includes a KuduSession::SetErrorBufferSpace API which can limit the amount of memory used to buffer errors from asynchronous operations.
  • The Java client now fetches tablet locations from the Kudu Master in batches of 1000, increased from batches of 10 in prior versions. This can substantially improve the performance of Spark and Impala queries running against Kudu tables with large numbers of tablets.
  • Table metadata lock contention in the Kudu Master was substantially reduced. This improves the performance of tablet location lookups on large clusters with a high degree of concurrency.
  • Lock contention in the Kudu Tablet Server during high-concurrency write workloads was also reduced. This can reduce CPU consumption and improve performance when a large number of concurrent clients are writing to a smaller number of a servers.
  • Lock contention when writing log messages has been substantially reduced. This source of contention could cause high tail latencies on requests, and when under high load could contribute to cluster instability such as election storms and request timeouts.
  • The BITSHUFFLE column encoding has been optimized to use the AVX2 instruction set present on processors including Intel(R) Sandy Bridge and later. Scans on BITSHUFFLE-encoded columns are now up to 30% faster.
  • The kudu tool now accepts hyphens as an alternative to underscores when specifying actions. For example, kudu local-replica copy-from-remote may be used as an alternative to kudu local_replica copy_from_remote.

Issues Fixed in Kudu 1.2.0

See Issues resolved for Kudu 1.2.0 and Git changes between 1.1.x and 1.2.x.

  • KUDU-1508 - Fixed a long-standing issue in which running Kudu on ext4 file systems could cause file system corruption.
  • KUDU-1399 - Implemented an LRU cache for open files, which prevents running out of file descriptors on long-lived Kudu clusters. By default, Kudu will limit its file descriptor usage to half of its configured ulimit.
  • Gerrit #5192 - Fixed an issue which caused data corruption and crashes in the case that a table had a non-composite (single-column) primary key, and that column was specified to use DICT_ENCODING or BITSHUFFLE encodings. If a table with an affected schema was written in previous versions of Kudu, the corruption will not be automatically repaired; users are encouraged to re-insert such tables after upgrading to Kudu 1.2 or later.
  • Gerrit #5541 - Fixed a bug in the Spark KuduRDD implementation which could cause rows in the result set to be silently skipped in some cases.
  • KUDU-1551 - Fixed an issue in which the tablet server would crash on restart in the case that it had previously crashed during the process of allocating a new WAL segment.
  • KUDU-1764 - Fixed an issue where Kudu servers would leak approximately 16-32MB of disk space for every 10GB of data written to disk. After upgrading to Kudu 1.2 or later, any disk space leaked in previous versions will be automatically recovered on startup.
  • KUDU-1750 - Fixed an issue where the API to drop a range partition would drop any partition with a matching lower _or_ upper bound, rather than any partition with matching lower _and_ upper bound.
  • KUDU-1766 - Fixed an issue in the Java client where equality predicates which compared an integer column to its maximum possible value (e.g. Integer.MAX_VALUE) would return incorrect results.
  • KUDU-1780 - Fixed the kudu-client Java artifact to properly shade classes in the com.google.thirdparty namespace. The lack of proper shading in prior releases could cause conflicts with certain versions of Google Guava.
  • Gerrit #5327 - Fixed shading issues in the kudu-flume-sink Java artifact. The sink now expects that Hadoop dependencies are provided by Flume, and properly shades the Kudu client's dependencies.
  • Fixed a few issues using the Python client library from Python 3.

Incompatible Changes in Kudu 1.2.0

Apache Kudu 1.2.0 introduces the following incompatible changes:

  • The replication factor of tables is now limited to a maximum of 7. In addition, it is no longer allowed to create a table with an even replication factor.
  • The GROUP_VARINT encoding is now deprecated. Kudu servers have never supported this encoding, and now the client-side constant has been deprecated to match the server's capabilities.
  • Client Library Compatibility
    • The Kudu 1.2 Java client is API- and ABI-compatible with Kudu 1.1. Applications written against Kudu 1.1 will compile and run against the Kudu 1.2 client and vice-versa.
    • The Kudu 1.2 cpp client is API- and ABI-forward-compatible with Kudu 1.1. Applications written and compiled against the Kudu 1.1 client will run without modification against the Kudu 1.2 client. Applications written and compiled against the Kudu 1.2 client will run without modification against the Kudu 1.1 client unless they use one of the following new APIs:
      • kudu::DisableSaslInitialization()
      • KuduSession::SetErrorBufferSpace(...)
    • The Kudu 1.2 Python client is API-compatible with Kudu 1.1. Applications written against Kudu 1.1 will continue to run against the Kudu 1.2 client and vice-versa.

Known Issues and Limitations in Kudu 1.2.0

Schema and Usage Limitations

  • Primary Key
    • Columns that are part of the primary key cannot be renamed. The primary key may not be changed after the table is created. You must drop and recreate a table to select a new primary key or rename key columns.
    • The primary key of a row may not be modified using the UPDATE functionality. To modify a row's primary key, the row must be deleted and re-inserted with the modified key. Such a modification is non-atomic.
    • Columns with DOUBLE, FLOAT, or BOOL types are not allowed as part of a primary key definition. Additionally, all columns that are part of a primary key definition must be NOT NULL.
  • Number of Columns - By default, Kudu will not permit the creation of tables with more than 300 columns. We recommend schema designs that use fewer columns for best performance.

  • Cell Size - No individual cell may be larger than 64KB. The cells making up a a composite key are limited to a total of 16KB after the internal composite-key encoding done by Kudu. Inserting rows not conforming to these limitations will result in errors being returned to the client.

  • Valid Identifiers - Identifiers such as column and table names are now restricted to be valid UTF-8 strings. Additionally, a maximum length of 256 characters is enforced.

    Also see, Limitations when Using Impala with Kudu.

Kudu 1.1.x Release Notes

Apache Kudu 1.1 includes the following new features and fixed issues.

New Features in Kudu 1.1.0

See also Issues resolved for Kudu 1.1.0 and Git changes between 1.0.x and 1.1.x.

  • The Python client has been brought up to feature parity with the Java and C++ clients and as such the package version will be brought to 1.1 with this release (from 0.3). A list of the highlights can be found below.
    • Improved Partial Row semantics

    • Range partition support

    • Scan Token API

    • Enhanced predicate support

    • Support for all Kudu data types (including a mapping of Python's datetime.datetime to UNIXTIME_MICROS)

    • Alter table support

    • Enabled Read at Snapshot for Scanners

    • Enabled Scanner Replica Selection

    • A few bug fixes for Python 3 in addition to various other improvements.

  • IN LIST predicate pushdown support was added to allow optimized execution of filters which match on a set of column values. Support for Spark, Map Reduce and Impala queries utilizing IN LIST pushdown is not yet complete.

  • The Java client now features client-side request tracing in order to help troubleshoot timeouts. Error messages are now augmented with traces that show which servers were contacted before the timeout occured instead of just the last error. The traces also contain RPCs that were required to fulfill the client's request, such as contacting the master to discover a tablet's location. Note that the traces are not available for successful requests and are not programatically queryable.

Performance

  • Kudu now publishes JAR files for Spark 2.0 compiled with Scala 2.11 along with the existing Spark 1.6 JAR compiled with Scala 2.10.

  • The Java client now allows configuring scanners to read from the closest replica instead of the known leader replica. The default remains the latter. Use the relevant ReplicaSelection enum with the scanner's builder to change this behavior.

Wire protocol compatibility

  • The Java client's sync API (KuduClient, KuduSession, KuduScanner) used to throw either a NonRecoverableException or a TimeoutException for a timeout, and now it's only possible for the client to throw the former.

  • The Java client's handling of errors in KuduSession was modified so that subclasses of KuduException are converted into RowErrors instead of being thrown.

Command line tools

  • The tool kudu tablet leader_step_down has been added to manually force a leader to step down.

  • The tool kudu remote_replica copy has been added to manually copy a replica from one running tablet server to another.

  • The tool kudu local_replica delete has been added to delete a replica of a tablet.

  • The kudu test loadgen tool has been added to replace the obsoleted insert-generated-rows standalone binary. The new tool is enriched with additional functionality and can be used to run load generation tests against a Kudu cluster.

Client APIs (C++/Java/Python)

  • The C++ client no longer requires the old gcc5 ABI. Which ABI is actually used depends on the compiler configuration. Some new distros (e.g. Ubuntu 16.04) will use the new ABI. Your application must use the same ABI as is used by the client library; an easy way to guarantee this is to use the same compiler to build both.

  • The C++ client's KuduSession::CountBufferedOperations() method is deprecated. Its behavior is inconsistent unless the session runs in the MANUAL_FLUSH mode. Instead, to get number of buffered operations, count invocations of the KuduSession::Apply() method since last KuduSession::Flush() call or, if using asynchronous flushing, since last invocation of the callback passed into KuduSession::FlushAsync().

  • The Java client's OperationResponse.getWriteTimestamp method was renamed to getWriteTimestampRaw to emphasize that it doesn't return milliseconds, unlike what its Javadoc indicated. The renamed method was also hidden from the public APIs and should not be used.

  • The Java client's sync API (KuduClient, KuduSession, KuduScanner) used to throw either a NonRecoverableException or a TimeoutException for a timeout, and now it's only possible for the client to throw the former.

  • The Java client's handling of errors in KuduSession was modified so that subclasses of KuduException are converted into RowErrors instead of being thrown.

Issues Fixed in Kudu 1.1.0

Kudu 1.0.1 Release Notes

Apache Kudu 1.0.1 is a bug fix release, with no new features or backwards incompatible changes.

Issues Fixed in Kudu 1.0.1

  • KUDU-1681: Fixed a bug in the tablet server which could cause a crash when the DNS lookup during master heartbeat failed.
  • KUDU-1660: Fixed a bug which would cause the Kudu master and tablet server to fail to start on single CPU systems.
  • KUDU-1652: Fixed a bug that would cause the C++ client, tablet server, and Java client to crash or throw an exception when attempting to scan a table with a predicate which simplifies to IS NOT NULL on a non-nullable column. For example, setting a '<= 127' predicate on an INT8 column could trigger this bug, since the predicate only filters null values.
  • KUDU-1651: Fixed a bug that would cause the tablet server to crash when evaluating a scan with predicates over a dictionary encoded column containing an entire block of null values.
  • KUDU-1623: Fixed a bug that would cause the tablet server to crash when handling UPSERT operations that only set values for the primary key columns.
  • Gerrit #4488: Fixed a bug in the Java client's KuduException class which could cause an unexpected NullPointerException to be thrown when the exception did not have an associated message.
  • KUDU-1090: Fixed a bug in the memory tracker which could cause a rare crash during tablet server startup.

Kudu 1.0.0 Release Notes

After approximately a year of beta releases, Apache Kudu has reached version 1.0. This version number signifies that the development team feels that Kudu is stable enough for usage in production environments.

Kudu 1.0.0 delivers a number of new features, bug fixes, and optimizations.

To upgrade Kudu to 1.0.0, see Upgrade Parcels or Upgrade Packages.

Other Noteworthy Changes

  • This is the first non-beta release of the Apache Kudu project. (Although because Kudu is not currently integrated into CDH, it is not yet an officially supported CDH component.)

New Features in Kudu 1.0.0

See also Issues resolved for Kudu 1.0.0 and Git changes between 0.10.0 and 1.0.0.

  • Removal of multiversion concurrency control (MVCC) history is now supported. This is known as tablet history GC. This allows Kudu to reclaim disk space, where previously Kudu would keep a full history of all changes made to a given table since the beginning of time. Previously, the only way to reclaim disk space was to drop a table.

  • Kudu will still keep historical data, and the amount of history retained is controlled by setting the configuration flag --tablet_history_max_age_sec, which defaults to 15 minutes (expressed in seconds). The timestamp represented by the current time minus tablet_history_max_age_sec is known as the ancient history mark (AHM). When a compaction or flush occurs, Kudu will remove the history of changes made prior to the ancient history mark. This only affects historical data; currently-visible data will not be removed. A specialized maintenance manager background task to remove existing "cold" historical data that is not in a row affected by the normal compaction process will be added in a future release.

  • Most of Kudu’s command line tools have been consolidated under a new top-level kudu tool. This reduces the number of large binaries distributed with Kudu and also includes much-improved help output.

  • The Kudu Flume Sink now supports processing events containing Avro-encoded records, using the new AvroKuduOperationsProducer.

  • Administrative tools including kudu cluster ksck now support running against multi-master Kudu clusters.

  • The output of the ksck tool is now colorized and much easier to read.

  • The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND mode. This can provide higher throughput for ingest workloads.

Performance

  • The performance of comparison predicates on dictionary-encoded columns has been substantially optimized. Users are encouraged to use dictionary encoding on any string or binary columns with low cardinality, especially if these columns will be filtered with predicates.

  • The Java client is now able to prune partitions from scanners based on the provided predicates. For example, an equality predicate on a hash-partitioned column will now only access those tablets that could possibly contain matching data. This is expected to improve performance for the Spark integration as well as applications using the Java client API.

  • The performance of compaction selection in the tablet server has been substantially improved. This can increase the efficiency of the background maintenance threads and improve overall throughput of heavy write workloads.

  • The policy by which the tablet server retains write-ahead log (WAL) files has been improved so that it takes into account other replicas of the tablet. This should help mitigate the spurious eviction of tablet replicas on machines that temporarily lag behind the other replicas.

Wire protocol compatibility

  • Kudu 1.0.0 maintains client-server wire-compatibility with previous releases. Applications using the Kudu client libraries may be upgraded either before, at the same time, or after the Kudu servers.

  • Kudu 1.0.0 does not maintain server-server wire compatibility with previous releases. Therefore, rolling upgrades between earlier versions of Kudu and Kudu 1.0.0 are not supported.

Incompatible Changes in Kudu 1.0.0

Command line tools

  • The kudu-pbc-dump tool has been removed. The same functionality is now implemented as kudu pbc dump.

  • The kudu-ksck tool has been removed. The same functionality is now implemented as kudu cluster ksck.

  • The cfile-dump tool has been removed. The same functionality is now implemented as kudu fs cfile dump.

  • The log-dump tool has been removed. The same functionality is now implemented as kudu wal dump and kudu local_replica dump wals.

  • The kudu-admin tool has been removed. The same functionality is now implemented within kudu table and kudu tablet.

  • The kudu-fs_dump tool has been removed. The same functionality is now implemented as kudu fs dump.

  • The kudu-ts-cli tool has been removed. The same functionality is now implemented within kudu master, kudu remote_replica, and kudu tserver.

  • The kudu-fs_list tool has been removed and some similar useful functionality has been moved under kudu local_replica.

Configuration flags

Some configuration flags are now marked as "unsafe" and "experimental". Such flags are disallowed by default. Users may access these flags by enabling the additional flags --unlock_unsafe_flags and --unlock_experimental_flags. Usage of such flags is not recommended, as the flags may be removed or modified with no deprecation period and without notice in future Kudu releases.

Client APIs (C++/Java/Python)

The TIMESTAMP column type has been renamed to UNIXTIME_MICROS in order to reduce confusion between Kudu’s timestamp support and the timestamps supported by other systems such as Apache Hive and Apache Impala (incubating). Existing tables will automatically be updated to use the new name for the type.

Clients upgrading to the new client libraries must move to the new name for the type. Clients using old client libraries will continue to operate using the old type name, even when connected to clusters that have been upgraded. Similarly, if clients are upgraded before servers, existing timestamp columns will be available using the new type name.

KuduSession methods in the C++ library are no longer advertised as thread-safe to have one set of semantics for both C++ and Java Kudu client libraries.

The KuduScanToken::TabletServers method in the C++ library has been removed. The same information can now be found in the KuduScanToken::tablet method.

Apache Flume Integration

The KuduEventProducer interface used to process Flume events into Kudu operations for the Kudu Flume Sink has changed, and has been renamed KuduOperationsProducer. The existing KuduEventProducers have been updated for the new interface, and have been renamed similarly.

Known Issues and Limitations of Kudu 1.0.0

Schema and Usage Limitations

  • Kudu is primarily designed for analytic use cases. You are likely to encounter issues if a single row contains multiple kilobytes of data.

  • The columns which make up the primary key must be listed first in the schema.

  • Key columns cannot be altered. You must drop and recreate a table to change its keys.

  • Key columns must not be null.

  • Columns with DOUBLE, FLOAT, or BOOL types are not allowed as part of a primary key definition.

  • Type and nullability of existing columns cannot be changed by altering the table.

  • A table's primary key cannot be changed.

  • Dropping a column does not immediately reclaim space. Compaction must run first. There is no way to run compaction manually, but dropping the table will reclaim the space immediately.

Partitioning Limitations

  • Tables must be manually pre-split into tablets using simple or compound primary keys. Automatic splitting is not yet possible. Range partitions may be added or dropped after a table has been created. See Schema Design for more information.

  • Data in existing tables cannot currently be automatically repartitioned. As a workaround, create a new table with the new partitioning and insert the contents of the old table.

Replication and Backup Limitations

  • Kudu does not currently include any built-in features for backup and restore. Users are encouraged to use tools such as Spark or Impala to export or import tables as necessary.

Impala Limitations

  • To use Kudu with Impala, you must install a special release of Impala called Impala_Kudu. Obtaining and installing a compatible Impala release is detailed in Using Apache Impala (incubating) with Kudu.

  • To use Impala_Kudu alongside an existing Impala instance, you must install using parcels.

  • Updates, inserts, and deletes via Impala are non-transactional. If a query fails part of the way through, its partial effects will not be rolled back.

  • All queries will be distributed across all Impala hosts which host a replica of the target table(s), even if a predicate on a primary key could correctly restrict the query to a single tablet. This limits the maximum concurrency of short queries made via Impala.

  • No TIMESTAMP and DECIMAL type support. (The underlying Kudu type formerly known as TIMESTAMP has been renamed to UNIXTIME_MICROS; currently, there is no Impala-compatible TIMESTAMP type.)

  • The maximum parallelism of a single query is limited to the number of tablets in a table. For good analytic performance, aim for 10 or more tablets per host or use large tables.

  • Impala is only able to push down predicates involving =, <=, >=, or BETWEEN comparisons between any column and a literal value, and < and > for integer columns only. For example, for a table with an integer key ts, and a string key name, the predicate WHERE ts >= 12345 will convert into an efficient range scan, whereas where name > 'lipcon' will currently fetch all data from the table and evaluate the predicate within Impala.

Security Limitations

  • Authentication and authorization features are not implemented.

  • Data encryption is not built in. Kudu has been reported to run correctly on systems using local block device encryption (e.g. dmcrypt).

Client and API Limitations

  • ALTER TABLE is not yet fully supported via the client APIs. More ALTER TABLE operations will become available in future releases.

Other Known Issues

The following are known bugs and issues with the current release of Kudu. They will be addressed in later releases. Note that this list is not exhaustive, and is meant to communicate only the most important known issues.

  • If the Kudu master is configured with the -log_fsync_all option, tablet servers and clients will experience frequent timeouts, and the cluster may become unusable.

  • If a tablet server has a very large number of tablets, it may take several minutes to start up. It is recommended to limit the number of tablets per server to 100 or fewer. Consider this limitation when pre-splitting your tables. If you notice slow start-up times, you can monitor the number of tablets per server in the web UI.

  • Due to a known bug in Linux kernels prior to 3.8, running Kudu on ext4 mount points may cause a subsequent fsck to fail with errors such as Logical start <N> does not match logical start <M> at next level . These errors are repairable using fsck -y, but may impact server restart time.

  • This affects RHEL/CentOS 6.8 and below. A fix is planned for RHEL/CentOS 6.9. RHEL 7.0 and higher are not affected. Ubuntu 14.04 and later are not affected. SLES 12 and later are not affected.

Issues Fixed in Kudu 1.0.0

Kudu 0.10.0 Release Notes

Kudu 0.10.0 delivers a number of new features, bug fixes, and optimizations.

See also Issues resolved for Kudu 0.10.0 and Git changes between 0.9.1 and 0.10.0.

To upgrade Kudu to 0.10.0, see Upgrade Parcels or Upgrade Packages.

Kudu 0.10.0 maintains wire-compatibility with previous releases, meaning that applications using the Kudu client libraries may be upgraded either before, at the same time, or after the Kudu servers. However, if you begin using new features of Kudu 0.10.0 such as manually range-partitioned tables, you must first upgrade all clients to this release.

After upgrading to Kudu 0.10.0, it is possible to downgrade to 0.9.x with the following exceptions:
  • Tables created in 0.10.0 will not be accessible after a downgrade to 0.9.x.

  • A multi-master setup formatted in 0.10.0 may not be downgraded to 0.9.x.

This release does not maintain full Java API or ABI compatibility with Kudu 0.9.x due to a package rename and some other small changes. See Incompatible Changes in Kudu 0.10.0 for details.

Other Noteworthy Changes

  • This is the first release of Apache Kudu as a top-level (non-incubating) project.
  • The default false positive rate for Bloom filters has been changed from 1% to 0.01%. This will increase the space consumption of Bloom filters by a factor of two (from approximately 10 bits per row to approximately 20 bits per row). This is expected to substantially improve the performance of random-write workloads at the cost of an incremental increase in disk space usage.
  • The Kudu C++ client library now has Doxygen-based API documentation available online.
  • Kudu now uses the Raft consensus algorithm even for unreplicated tables. This change simplifies code and will also allow administrators to enable replication on a previously unreplicated table. This change is internal and should not be visible to users.

New Features in Kudu 0.10.0

  • Users may now manually manage the partitioning of a range-partitioned table. When a table is created, the user may specify a set of range partitions that do not cover the entire available key space. A user may add or drop range partitions to existing tables.

    This feature can be particularly helpful with time series workloads in which new partitions can be created on an hourly or daily basis. Old partitions may be efficiently dropped if the application does not need to retain historical data past a certain point.

  • Support for running Kudu clusters with multiple masters has been stabilized. Users may start a cluster with three or five masters to provide fault tolerance despite a failure of one or two masters, respectively.

    Certain tools such as ksck lack complete support for multiple masters. These deficiencies will be addressed in a following release.

  • Kudu now supports the ability to reserve a certain amount of free disk space in each of its configured data directories. If a directory's free disk space drops to less than the configured minimum, Kudu will stop writing to that directory until space becomes available. If no space is available in any configured directory, Kudu will abort.

    This feature may be configured using the --fs_data_dirs_reserved_bytes and --fs_wal_dir_reserved_bytes flags.

  • The Spark integration's KuduContext now supports four new methods for writing to Kudu tables: insertRows, upsertRows, updateRows, and deleteRows. These are now the preferred way to write to Kudu tables from Spark.

Other Improvements in Kudu 0.10.0

  • KUDU-1516: The kudu-ksck tool has been improved and now detects problems such as when a tablet does not have a majority of replicas on live tablet servers, or if those replicas aren’t in a good state. Users who currently depend on the tool to detect inconsistencies may now see failures when before they wouldn't see any.

  • Gerrit #3477: The way operations are buffered in the Java client has been reworked. Previously, the session's buffer size was set per tablet, meaning that a buffer size of 1,000 for 10 tablets being written to allowed for 10,000 operations to be buffered at the same time. With this change, all the tablets share one buffer, so users might need to set a bigger buffer size in order to reach the same level of performance as before.

  • Gerrit #3674: Added LESS and GREATER options for column predicates.

  • KUDU-1444: Added support for passing back basic per-scan metrics, such as cache hit rate, from the server to the C++ client. See the KuduScanner::GetResourceMetrics() API for detailed usage. This feature will be supported in the Java client API in a future release.

  • KUDU-1446: Improved the order in which the tablet server evaluates predicates, so that predicates on smaller columns are evaluated first. This may improve performance on queries which apply predicates on multiple columns of different sizes.

  • KUDU-1398: Improved the storage efficiency of Kudu's internal primary key indexes. This optimization should decrease space usage and improve random access performance, particularly for workloads with lengthy primary keys.

Issues Fixed in Kudu 0.10.0

  • Gerrit #3541: Fixed a problem in the Java client whereby an RPC could be dropped when a connection to a tablet server or master was forcefully closed on the server-side while RPCs to that server were in the process of being encoded. The effect was that the RPC would not be sent, and users of the synchronous API would receive a TimeoutException. Several other Java client bugs which could cause similar spurious timeouts were also fixed in this release.
  • Gerrit #3724: Fixed a problem in the Java client whereby an RPC could be dropped when a socket timeout was fired while that RPC was being sent to a tablet server or master. This would manifest itself in the same way as Gerrit #3541.
  • KUDU-1538: Fixed a bug in which recycled block identifiers could cause the tablet server to lose data. Following this bug fix, block identifiers will no longer be reused.

Incompatible Changes in Kudu 0.10.0

  • Gerrit #3737: The Java client has been repackaged under org.apache.kudu instead of org.kududb. Import statements for Kudu classes must be modified in order to compile against 0.10.0. Wire compatibility is maintained.
  • Gerrit #3055: The Java client's synchronous API methods now throw KuduException instead of Exception. Existing code that catches Exception should still compile, but introspection of an exception's message may be impacted. This change was made to allow thrown exceptions to be queried more easily using KuduException.getStatus and calling one of Status's methods. For example, an operation that tries to delete a table that doesn't exist would return a Status that returns true when queried on isNotFound().
  • The Java client's KuduTable.getTabletsLocations set of methods is now deprecated. Additionally, they now take an exclusive end partition key instead of an inclusive key. Applications are encouraged to use the scan tokens API instead of these methods in the future.
  • The C++ API for specifying split points on range-partitioned tables has been improved to make it easier for callers to properly manage the ownership of the provided rows.
  • The TableCreator::split_rows API took a vector<const KuduPartialRow*>, which made it very difficult for the calling application to do proper error handling with cleanup when setting the fields of the KuduPartialRow. This API has been now been deprecated and replaced by a new method TableCreator::add_range_split which allows easier use of smart pointers for safe memory management.
  • The Java client's internal buffering has been reworked. Previously, the number of buffered write operations was constrained on a per-tablet-server basis. Now, the configured maximum buffer size constrains the total number of buffered operations across all tablet servers in the cluster. This provides a more consistent bound on the memory usage of the client regardless of the size of the cluster to which it is writing. This change can negatively affect the write performance of Java clients which rely on buffered writes. Consider using the setMutationBufferSpace API to increase a session's maximum buffer size if write performance seems to be degraded after upgrading to Kudu 0.10.0.
  • The "remote bootstrap" process used to copy a tablet replica from one host to another has been renamed to "Tablet Copy". This resulted in the renaming of several RPC metrics. Any users previously explicitly fetching or monitoring metrics related to Remote Bootstrap should update their scripts to reflect the new names.
  • The SparkSQL datasource for Kudu no longer supports mode Overwrite. Users should use the new KuduContext.upsertRows method instead. Additionally, inserts using the datasource are now upserts by default. The older behavior can be restored by setting the operation parameter to insert.

Kudu 0.9.1 Release Notes

Kudu 0.9.1 delivers incremental bug fixes over Kudu 0.9.0. It is fully compatible with Kudu 0.9.0. See also Issues resolved for Kudu 0.9.1 and Git changes between 0.9.0 and 0.9.1.

To upgrade Kudu to 0.9.1, see Upgrade Parcels or Upgrade Packages.

Issues Fixed in Kudu 0.9.1

  • KUDU-1469 fixes a bug in Kudu's Raft consensus implementation that could cause a tablet to stop making progress after a leader election.
  • Gerrit #3456 fixes a bug in which servers under high load could store metric information in incorrect memory locations, causing crashes or data corruption.
  • Gerrit #3457 fixes a bug in which errors from the Java client would carry an incorrect error message.
  • Other small bug fixes were backported to improve stability.

Kudu 0.9.0 Release Notes

Kudu 0.9.0 delivers incremental features, improvements, and bug fixes. See also Issues resolved for Kudu 0.9 and Git changes between 0.8.0 and 0.9.0.

To upgrade Kudu to 0.9.0, see Upgrade Parcels or Upgrade Packages.

New Features in Kudu 0.9.0

  • KUDU-1306: Scan token API for creating partition-aware scan descriptors. This API simplifies executing parallel scans for clients and query engines.
  • KUDU-1002: Added support for UPSERT operations, whereby a row is inserted if it does not yet exist, but updated if it does. Support for UPSERT is included in the Java, C++, and Python APIs, but not Impala.
  • Gerrit 2848: Added a Kudu datasource for Spark. This datasource uses the Kudu client directly instead of using the MapReduce API. Predicate pushdowns for spark-sql and Spark filters are included, as well as parallel retrieval for multiple tablets and column projections. See an example of Kudu integration with Spark.
  • Gerrit 2992: Added the ability to update and insert from Spark using a Kudu datasource.

Other Improvements and Changes in Kudu 0.9.0

All Kudu clients have longer default timeout values, as listed below.
Java
  • The default operation timeout and the default admin operation timeout are now set to 30 seconds instead of 10.
  • The default socket read timeout is now 10 seconds instead of 5.
C++
  • The default admin timeout is now 30 seconds instead of 10.
  • The default RPC timeout is now 10 seconds instead of 5.
  • The default scan timeout is now 30 seconds instead of 15.
Some default settings related to I/O behavior during flushes and compactions have been changed:
  • The default for flush_threshold_mb has been increased from 64 MB to 1000 MB.
  • The default for cfile_do_on_finish has been changed from close to flush. Experiments using YCSB indicate that these values provide better throughput for write-heavy applications on typical server hardware.
  • KUDU-1415: Added statistics in the Java client, such as the number of bytes written and the number of operations applied.
  • KUDU-1451: Tablet servers take less time to restart when the tablet server must clean up many previously deleted tablets. Tablets are now cleaned up after they are deleted.

Issues Fixed in Kudu 0.9.0

  • KUDU-678: Fixed a leak that occurred during DiskRowSet compactions where tiny blocks were still written to disk even if there were no REDO records. With the default block manager, this often resulted in block containers with thousands of tiny blocks.
  • KUDU-1437: Fixed a data corruption issue that occurred after compacting sequences of negative INT32 values in a column that was configured with RLE encoding.

Incompatible Changes in Kudu 0.9.0

  • The KuduTableInputFormat command has changed the way in which it handles scan predicates, including how it serializes predicates to the job configuration object. The new configuration key is kudu.mapreduce.encoded.predicate. Clients using the TableInputFormatConfigurator are not affected.
  • The kudu-spark sub-project has been renamed to follow naming conventions for Scala. The new name is kudu-spark_2.10.
  • Default table partitioning has been removed. All tables must now be created with explicit partitioning. Existing tables are unaffected. See the schema design guide for more details.

Limitations of Kudu 0.9.0

Kudu 0.9.0 has the same limitations as Kudu 0.8, listed in Limitations of Kudu 0.7.0.

Upgrade Notes for Kudu 0.9.0

Before upgrading to Kudu 0.9.0, see Incompatible Changes in Kudu 0.9.0.

Kudu 0.8.0 Release Notes

Kudu 0.8.0 delivers incremental features, improvements, and bug fixes over the previous versions. See also Issues resolved for Kudu 0.8 and Git changes between 0.7.1 and 0.8.0

To upgrade Kudu to 0.8.0, see Upgrade Parcels or Upgrade Packages

New Features in Kudu 0.8.0

  • KUDU-431: A simple Flume sink has been implemented.

Other Improvements in Kudu 0.8.0

  • KUDU-839: Java RowError now uses an enum error code.
  • Gerrit 2138: The handling of column predicates has been re-implemented in the server and clients.
  • KUDU-1379: Partition pruning has been implemented for C++ clients (but not yet for the Java client). This feature allows you to avoid reading a tablet if you know it does not serve the row keys you are querying.
  • Gerrit 2641: Kudu now uses earliest-deadline-first RPC scheduling and rejection. This changes the behavior of the RPC service queue to prevent unfairness when processing a backlog of RPC threads and to increase the likelihood that an RPC will be processed before it can time out.
  • Gerrit 2239: The concept of "feature flags" was introduced in order to manage compatibility between different Kudu versions. One case where this is helpful is if a newer client attempts to use a feature unsupported by the currently-running tablet server. Rather than receiving a cryptic error, the user gets an error message that is easier to interpret. This is an internal change for Kudu system developers and requires no action by users of the clients or API.

Issues Fixed in Kudu 0.8.0

  • KUDU-1337: Tablets from tables that were deleted might be unnecessarily re-bootstrapped when the leader gets the notification to delete itself after the replicas do.
  • KUDU-969: If a tablet server shuts down while compacting a rowset and receiving updates for it, it might immediately crash upon restart while bootstrapping that rowset's tablet.
  • KUDU-1354: Due to a bug in the Kudu implementation of MVCC where row locks were released before the MVCC commit happened, flushed data would include out-of-order transactions, triggering a crash on the next compaction.
  • KUDU-1322: The C++ client now retries write operations if the tablet it is trying to reach has already been deleted.
  • Gerrit 2571: Due to a bug in the Java client, users were unable to close the kudu-spark shell because of lingering non-daemon threads.

Incompatible Changes in Kudu 0.8.0

0.8.0 clients are not fully compatible with servers running Kudu 0.7.1 or lower. In particular, scans that specify column predicates will fail. To work around this issue, upgrade all Kudu servers before upgrading clients.

Limitations of Kudu 0.8.0

Kudu 0.8.0 has the same limitations as Kudu 0.7.0, listed in Limitations of Kudu 0.7.0.

Upgrade Notes for Kudu 0.8.0

Before upgrading to Kudu 0.8.0, see Incompatible Changes in Kudu 0.8.0.

Kudu 0.7.1 Release Notes

Kudu 0.7.1 is a bug-fix release for 0.7.0. Users of Kudu 0.7.0 should upgrade to this version. See also Issues resolved for Kudu 0.7.1 and Git changes between 0.7.0 and 0.7.1.

To upgrade Kudu to 0.7.1, see Upgrade Parcels or Upgrade Packages.

Issues Fixed in Kudu 0.7.1

For a list of issues fixed in Kudu 0.7.1, see this JIRA query. The following notable fixes are included:

  • KUDU-1325 fixes a tablet server crash that could occur during table deletion. In some cases, while a table was being deleted, other replicas would attempt to re-replicate tablets to servers that had already processed the deletion. This could trigger a race condition that caused a crash.
  • KUDU-1341 fixes a potential data corruption and crash that could happen shortly after tablet server restarts in workloads that repeatedly delete and re-insert rows with the same primary key. In most cases, this corruption affected only a single replica and could be repaired by re-replicating from another.
  • KUDU-1343 fixes a bug in the Java client that occurs when a scanner has to scan multiple batches from one tablet and then start scanning from another. In particular, this affected any scans using the Java client that read large numbers of rows from multi-tablet tables.
  • KUDU-1345 fixes a bug where the hybrid clock could jump backwards, resulting in a crash followed by an inability to restart the affected tablet server.
  • KUDU-1360 fixes a bug in the kudu-spark module that prevented reading rows with NULL values.

Limitations of Kudu 0.7.1

Kudu 0.7.1 has the same limitations as Kudu 0.7.0, listed in Limitations of Kudu 0.7.0.

Upgrade Notes For Kudu 0.7.1

Kudu 0.7.1 has the same upgrade notes as Kudu 0.7.0, listed in Upgrade Notes For Kudu 0.7.0.

Kudu 0.7.0 Release Notes

Kudu 0.7.0 is the first release as part of the Apache Incubator and includes a number of changes, new features, improvements, and fixes. See also Issues resolved for Kudu 0.7.0 and Git changes between 0.6.0 and 0.7.0.

To upgrade Kudu to 0.7, see Upgrade Parcels or Upgrade Packages.

New Features in Kudu 0.7.0

Initial work for Spark integration
With the goal of Spark integration, a new kuduRDD API has been added that wraps newAPIHadoopRDD and includes a default source for Spark SQL.

Other Improvements in Kudu 0.7.0

  • Support for RHEL 7, CentOS 7, and SLES 12 has been added.
  • The Python client is no longer considered experimental.
  • The file block manager performance is improved, but it is still not recommended for real-world use.
  • The master now attempts to spread tablets more evenly across the cluster during table creation. This has no impact on existing tables, but improves the speed at which under-replicated tablets are re-replicated after a tablet server failure.
  • All licensing documents have been modified to adhere to ASF guidelines.
  • The C++ client library is now explicitly built against the old GCC 5 ABI. If you use gcc5 to build a Kudu application, your application must use the old ABI as well. This is typically achieved by defining the _GLIBCXX_USE_CXX11_ABI macro at compile time when building your application. For more information, see GCC 5 and the C++ 11 ABI.

Issues Fixed in Kudu 0.7.0

For a list of issues fixed in Kudu 0.7, see this JIRA query.

Incompatible Changes in Kudu 0.7.0

  • The C++ client includes a new API, KuduScanBatch, which performs better when a large number of small rows are returned in a batch. The old API of vector<KuduRowResult> is deprecated.
  • The default replication factor has been changed from 1 to 3. Existing tables continue to use the replication factor they were created with. Applications that create tables may not work properly if they assume a replication factor of 1 and fewer than 3 replicas are available. To use the previous default replication factor, start the master with the configuration flag --default_num_replicas=1.
  • The Python client has been rewritten, with a focus on improving code quality and testing. The read path (scanners) has been improved by adding many of the features already supported by the C++ and Java clients. The Python client is no longer considered experimental.

Limitations of Kudu 0.7.0

Operating System Limitations

  • RHEL 7 or 6.4 or newer, CentOS 7 or 6.4 or newer, and Ubuntu Trusty are the only operating systems supported for installation in the public beta. Others may work but have not been tested. You can build Kudu from source on SLES 12, but binaries are not provided.

Storage Limitations

  • Kudu has been tested with up to 4 TB of data per tablet server. More testing is needed for denser storage configurations.

Schema Limitations

  • Testing with more than 20 columns has been limited.
  • Multi-kilobyte rows have not been thoroughly tested.
  • The columns that make up the primary key must be listed first in the schema.
  • Key columns cannot be altered. You must drop and re-create a table to change its keys.
  • Key columns must not be null.
  • Columns with DOUBLE, FLOAT, or BOOL types are not allowed as part of a primary key definition.
  • Type and nullability of existing columns cannot be changed by altering the table.
  • A table’s primary key cannot be changed.
  • Dropping a column does not immediately reclaim space.; compaction must run first. You cannot run compaction manually. Dropping the table reclaims space immediately.

Ingest Limitations

  • Ingest through Sqoop or Flume is not supported in the public beta. For bulk ingest, use Impala’s CREATE TABLE AS SELECT functionality or use Kudu's Java or C++ API.
  • Tables must be manually pre-split into tablets using simple or compound primary keys. Automatic splitting is not yet possible. Instead, add split rows at table creation.
  • Tablets cannot currently be merged. Instead, create a new table with the contents of the old tables to be merged.

Cloudera Manager Limitations

  • Some metrics, such as latency histograms, are not yet available in Cloudera Manager.
  • Some service and role chart pages are still under development. More charts and metrics will be visible in future releases.

Replication and Backup Limitations

  • Replication and failover of Kudu masters is considered experimental. Cloudera recommends running a single master and periodically perform a manual backup of its data directories.

Impala Limitations

  • To use Kudu with Impala, you must install a special release of Impala. Obtaining and installing a compatible Impala release is detailed in Using Apache Impala (incubating) with Kudu.
  • To use Impala_Kudu alongside an existing Impala instance, you must install using parcels.
  • Updates, inserts, and deletes through Impala are nontransactional. If a query fails, any partial effects are not be rolled back.
  • All queries are distributed across all Impala nodes that host a replica of the target table(s), even if a predicate on a primary key could correctly restrict the query to a single tablet. This limits the maximum concurrency of short queries made through Impala.
  • Timestamp and decimal type are not supported.
  • The maximum parallelism of a single query is limited to the number of tablets in a table. To optimize analytic performance, spread your data across 10 or more tablets per host for a large table.
  • Impala can push down only predicates involving =, <=, >=, or BETWEEN comparisons between a column and a literal value. Impala pushes down predicates < and > for integer columns only. For example, for a table with an integer key ts, and a string key name, the predicate WHERE ts >= 12345 converts to an efficient range scan, whereas WHERE name > smith currently fetches all data from the table and evaluates the predicate within Impala.

Security Limitations

  • Authentication and authorization are not included in the public beta.
  • Data encryption is not included in the public beta.

Client and API Limitations

  • Potentially incompatible C++ and Java API changes may be required during the public beta.
  • ALTER TABLE is not yet fully supported through the client APIs. More ALTER TABLE operations will be available in future betas.

Application Integration Limitations

  • The Spark DataFrame implementation is not yet complete.

Other Known Issues

The following are known bugs and issues with the current beta release. They will be addressed in later beta releases.

  • Building Kudu from source using gcc 4.6 or 4.7 causes runtime and test failures. Be sure you are using a different version of gcc if you build Kudu from source.
  • If the Kudu master is configured with the -log_fsync_all option, tablet servers and clients will experience frequent timeouts, and the cluster may become unusable.
  • If a tablet server has a very large number of tablets, it may take several minutes to start up. Limit the number of tablets per server to 100 or fewer, and consider this limitation when pre-splitting your tables. If you notice slow start-up times, you can monitor the number of tablets per server in the web UI.

Upgrade Notes For Kudu 0.7.0

  • Kudu 0.7.0 maintains wire compatibility with Kudu 0.6.0. A Kudu 0.7.0 client can communicate with a Kudu 0.6.0 cluster, and vice versa. For that reason, you do not need to upgrade client JARs at the same time the cluster is upgraded.

  • The same wire compatibility guarantees apply to the Impala_Kudu fork that was released with Kudu 0.5.0.
  • Review Incompatible Changes in Kudu 0.7.0 before upgrading to Kudu 0.7.

See Upgrading Kudu for instructions.

Kudu 0.6 Release Notes

To upgrade Kudu to 0.6, see Upgrade Parcels or Upgrade Packages.

New Features in Kudu 0.6

Row Error Reporting
The Java client includes new methods countPendingErrors() and getPendingErrors() on KuduSession. These methods allow you to count and retrieve outstanding row errors when configuring sessions with AUTO_FLUSH_BACKGROUND.
New Server-Side Metrics
New server-level metrics allow you to monitor CPU usage and context switching.

Issues Fixed in Kudu 0.6

For a list of issues addressed in Kudu 0.6, see this JIRA query.

Limitations of Kudu 0.6

Operating System Limitations

  • RHEL 6.4 or newer, CentOS 6.4 or newer, and Ubuntu Trusty are the only operating systems supported for installation in the public beta. Others may work but have not been tested.

Storage Limitations

  • Kudu has been tested with up to 4 TB of data per tablet server. More testing is needed for denser storage configurations.

Schema Limitations

  • Testing with more than 20 columns has been limited.
  • Multi-kilobyte rows have not been thoroughly tested.
  • The columns which make up the primary key must be listed first in the schema.
  • Key columns cannot be altered. You must drop and recreate a table to change its keys.
  • Key columns must not be null.
  • Columns with DOUBLE, FLOAT, or BOOL types are not allowed as part of a primary key definition.
  • Type and nullability of existing columns cannot be changed by altering the table.
  • A table’s primary key cannot be changed.
  • Dropping a column does not immediately reclaim space. Compaction must run first. There is no way to run compaction manually, but dropping the table will reclaim the space immediately.

Ingest Limitations

  • Ingest using Sqoop or Flume is not supported in the public beta. The recommended approach for bulk ingest is to use Impala’s CREATE TABLE AS SELECT functionality or use the Kudu's Java or C++ API.
  • Tables must be manually pre-split into tablets using simple or compound primary keys. Automatic splitting is not yet possible. Instead, add split rows at table creation.
  • Tablets cannot currently be merged. Instead, create a new table with the contents of the old tables to be merged.

Cloudera Manager Limitations

  • Some metrics, such as latency histograms, are not yet available in Cloudera Manager.
  • Some service and role chart pages are still under development. More charts and metrics will be visible in future releases.

Replication and Backup Limitatinos

  • Replication and failover of Kudu masters is considered experimental. It is recommended to run a single master and periodically perform a manual backup of its data directories.

Impala Limitations

  • To use Kudu with Impala, you must install a special release of Impala. Obtaining and installing a compatible Impala release is detailed in Using Apache Impala (incubating) with Kudu.
  • To use Impala_Kudu alongside an existing Impala instance, you must install using parcels.
  • Updates, inserts, and deletes using Impala are non-transactional. If a query fails part of the way through, its partial effects will not be rolled back.
  • All queries will be distributed across all Impala nodes which host a replica of the target table(s), even if a predicate on a primary key could correctly restrict the query to a single tablet. This limits the maximum concurrency of short queries made using Impala.
  • No timestamp and decimal type support.
  • The maximum parallelism of a single query is limited to the number of tablets in a table. For good analytic performance, aim for 10 or more tablets per host or large tables.
  • Impala is only able to push down predicates involving =, <=, >=, or BETWEEN comparisons between a column and a literal value. Impala pushes down predicates < and > for integer columns only. For example, for a table with an integer key ts, and a string key name, the predicate WHERE ts >= 12345 will convert into an efficient range scan, whereas WHERE name > smith will currently fetch all data from the table and evaluate the predicate within Impala.

Security Limitations

  • Authentication and authorization are not included in the public beta.
  • Data encryption is not included in the public beta.

Client and API Limitations

  • Potentially-incompatible C++ and Java API changes may be required during the public beta.
  • ALTER TABLE is not yet fully supported using the client APIs. More ALTER TABLE operations will become available in future betas.
  • The Python API is not supported.

Application Integration Limitations

  • The Spark DataFrame implementation is not yet complete.

Other Known Issues

The following are known bugs and issues with the current beta release. They will be addressed in later beta releases.

  • Building Kudu from source using gcc 4.6 or 4.7 causes runtime and test failures. Be sure you are using a different version of gcc if you build Kudu from source.
  • If the Kudu master is configured with the -log_fsync_all option, tablet servers and clients will experience frequent timeouts, and the cluster may become unusable.
  • If a tablet server has a very large number of tablets, it may take several minutes to start up. It is recommended to limit the number of tablets per server to 100 or fewer. Consider this limitation when pre-splitting your tables. If you notice slow start-up times, you can monitor the number of tablets per server in the web UI.

Upgrade Notes For Kudu 0.6

  • Kudu 0.6.0 maintains wire compatibility with Kudu 0.5.0. This means that a Kudu 0.6.0 client can communicate with a Kudu 0.5.0 cluster, and vice versa. For that reason, you do not need to upgrade client JARs at the same time the cluster is upgraded.

  • The same wire compatibility guarantees apply to the Impala_Kudu fork that was released with Kudu 0.5.0 and 0.6.0.
  • The Kudu 0.6.0 client API is not compatible with the Kudu 0.5.0 client API. See the Kudu 0.6.0 release notes for details.

See Upgrading Kudu for instructions.

Kudu 0.5 Release Notes

Limitations of Kudu 0.5

Operating System Limitations

  • RHEL 6.4 or newer, CentOS 6.4 or newer, and Ubuntu Trusty are the only operating systems supported for installation in the public beta. Others may work but have not been tested.

Storage Limitations

  • Kudu has been tested with up to 4 TB of data per tablet server. More testing is needed for denser storage configurations.

Schema Limitations

  • Testing with more than 20 columns has been limited.
  • Multi-kilobyte rows have not been thoroughly tested.
  • The columns which make up the primary key must be listed first in the schema.
  • Key columns cannot be altered. You must drop and recreate a table to change its keys.
  • Key columns must not be null.
  • Columns with DOUBLE, FLOAT, or BOOL types are not allowed as part of a primary key definition.
  • Type and nullability of existing columns cannot be changed by altering the table.
  • A table’s primary key cannot be changed.
  • Dropping a column does not immediately reclaim space. Compaction must run first. There is no way to run compaction manually, but dropping the table will reclaim the space immediately.

Ingest Limitations

  • Ingest using Sqoop or Flume is not supported in the public beta. The recommended approach for bulk ingest is to use Impala’s CREATE TABLE AS SELECT functionality or use the Kudu's Java or C++ API.
  • Tables must be manually pre-split into tablets using simple or compound primary keys. Automatic splitting is not yet possible. Instead, add split rows at table creation.
  • Tablets cannot currently be merged. Instead, create a new table with the contents of the old tables to be merged.

Cloudera Manager Limitations

  • Some metrics, such as latency histograms, are not yet available in Cloudera Manager.
  • Some service and role chart pages are still under development. More charts and metrics will be visible in future releases.

Replication and Backup Limitatinos

  • Replication and failover of Kudu masters is considered experimental. It is recommended to run a single master and periodically perform a manual backup of its data directories.

Impala Limitations

  • To use Kudu with Impala, you must install a special release of Impala. Obtaining and installing a compatible Impala release is detailed in Using Apache Impala (incubating) with Kudu.
  • To use Impala_Kudu alongside an existing Impala instance, you must install using parcels.
  • Updates, inserts, and deletes using Impala are non-transactional. If a query fails part of the way through, its partial effects will not be rolled back.
  • All queries will be distributed across all Impala nodes which host a replica of the target table(s), even if a predicate on a primary key could correctly restrict the query to a single tablet. This limits the maximum concurrency of short queries made using Impala.
  • No timestamp and decimal type support.
  • The maximum parallelism of a single query is limited to the number of tablets in a table. For good analytic performance, aim for 10 or more tablets per host or large tables.
  • Impala is only able to push down predicates involving =, <=, >=, or BETWEEN comparisons between a column and a literal value. Impala pushes down predicates < and > for integer columns only. For example, for a table with an integer key ts, and a string key name, the predicate WHERE ts >= 12345 will convert into an efficient range scan, whereas WHERE name > smith will currently fetch all data from the table and evaluate the predicate within Impala.

Security Limitations

  • Authentication and authorization are not included in the public beta.
  • Data encryption is not included in the public beta.

Client and API Limitations

  • Potentially-incompatible C++ and Java API changes may be required during the public beta.
  • ALTER TABLE is not yet fully supported using the client APIs. More ALTER TABLE operations will become available in future betas.
  • The Python API is not supported.

Application Integration Limitations

  • The Spark DataFrame implementation is not yet complete.

Other Known Issues

The following are known bugs and issues with the current beta release. They will be addressed in later beta releases.

  • Building Kudu from source using gcc 4.6 or 4.7 causes runtime and test failures. Be sure you are using a different version of gcc if you build Kudu from source.
  • If the Kudu master is configured with the -log_fsync_all option, tablet servers and clients will experience frequent timeouts, and the cluster may become unusable.
  • If a tablet server has a very large number of tablets, it may take several minutes to start up. It is recommended to limit the number of tablets per server to 100 or fewer. Consider this limitation when pre-splitting your tables. If you notice slow start-up times, you can monitor the number of tablets per server in the web UI.

Next Steps