Monday, July 7, 2014

Custom Cassandra Data Types

In the blog post Connecting to Cassandra from Java, I mentioned that one advantage for Java developers of Cassandra being implemented in Java is the ability to create custom Cassandra data types. In this post, I outline how to do this in greater detail.

Cassandra has numerous built-in data types, but there are situations in which one may want to add a custom type. Cassandra custom data types are implemented in Java by extending the org.apache.cassandra.db.marshal.AbstractType class. The class that extends this must ultimately implement three methods with the following signatures:

public ByteBuffer fromString(final String) throws MarshalException
public TypeSerializer getSerializer()
public int compare(Object, Object)

This post's example implementation of AbstractType is shown in the next code listing.

UnitedStatesState.java - Extends AbstractType
package dustin.examples.cassandra.cqltypes;

import org.apache.cassandra.db.marshal.AbstractType;
import org.apache.cassandra.serializers.MarshalException;
import org.apache.cassandra.serializers.TypeSerializer;

import java.nio.ByteBuffer;

/**
 * Representation of a state in the United States that
 * can be persisted to Cassandra database.
 */
public class UnitedStatesState extends AbstractType
{
   public static final UnitedStatesState instance = new UnitedStatesState();

   @Override
   public ByteBuffer fromString(final String stateName) throws MarshalException
   {
      return getStateAbbreviationAsByteBuffer(stateName);
   }

   @Override
   public TypeSerializer getSerializer()
   {
      return UnitedStatesStateSerializer.instance;
   }

   @Override
   public int compare(Object o1, Object o2)
   {
      if (o1 == null && o2 == null)
      {
         return 0;
      }
      else if (o1 == null)
      {
         return 1;
      }
      else if (o2 == null)
      {
         return -1;
      }
      else
      {
         return o1.toString().compareTo(o2.toString());
      }
   }

   /**
    * Provide standard two-letter abbreviation for United States
    * state whose state name is provided.
    *
    * @param stateName Name of state whose abbreviation is desired.
    * @return State's abbreviation as a ByteBuffer; will return "UK"
    *    if provided state name is unexpected value.
    */
   private ByteBuffer getStateAbbreviationAsByteBuffer(final String stateName)
   {
      final String upperCaseStateName = stateName != null ? stateName.toUpperCase().replace(" ", "_") : "UNKNOWN";
      String abbreviation;
      try
      {
         abbreviation =  upperCaseStateName.length() == 2
                       ? State.fromAbbreviation(upperCaseStateName).getStateAbbreviation()
                       : State.valueOf(upperCaseStateName).getStateAbbreviation();
      }
      catch (Exception exception)
      {
         abbreviation = State.UNKNOWN.getStateAbbreviation();
      }
      return ByteBuffer.wrap(abbreviation.getBytes());
   }
}

The above class listing references the State enum, which is shown next.

State.java
package dustin.examples.cassandra.cqltypes;

/**
 * Representation of state in the United States.
 */
public enum State
{
   ALABAMA("Alabama", "AL"),
   ALASKA("Alaska", "AK"),
   ARIZONA("Arizona", "AZ"),
   ARKANSAS("Arkansas", "AR"),
   CALIFORNIA("California", "CA"),
   COLORADO("Colorado", "CO"),
   CONNECTICUT("Connecticut", "CT"),
   DELAWARE("Delaware", "DE"),
   DISTRICT_OF_COLUMBIA("District of Columbia", "DC"),
   FLORIDA("Florida", "FL"),
   GEORGIA("Georgia", "GA"),
   HAWAII("Hawaii", "HI"),
   IDAHO("Idaho", "ID"),
   ILLINOIS("Illinois", "IL"),
   INDIANA("Indiana", "IN"),
   IOWA("Iowa", "IA"),
   KANSAS("Kansas", "KS"),
   LOUISIANA("Louisiana", "LA"),
   MAINE("Maine", "ME"),
   MARYLAND("Maryland", "MD"),
   MASSACHUSETTS("Massachusetts", "MA"),
   MICHIGAN("Michigan", "MI"),
   MINNESOTA("Minnesota", "MN"),
   MISSISSIPPI("Mississippi", "MS"),
   MISSOURI("Missouri", "MO"),
   MONTANA("Montana", "MT"),
   NEBRASKA("Nebraska", "NE"),
   NEVADA("Nevada", "NV"),
   NEW_HAMPSHIRE("New Hampshire", "NH"),
   NEW_JERSEY("New Jersey", "NJ"),
   NEW_MEXICO("New Mexico", "NM"),
   NORTH_CAROLINA("North Carolina", "NC"),
   NORTH_DAKOTA("North Dakota", "ND"),
   NEW_YORK("New York", "NY"),
   OHIO("Ohio", "OH"),
   OKLAHOMA("Oklahoma", "OK"),
   OREGON("Oregon", "OR"),
   PENNSYLVANIA("Pennsylvania", "PA"),
   RHODE_ISLAND("Rhode Island", "RI"),
   SOUTH_CAROLINA("South Carolina", "SC"),
   SOUTH_DAKOTA("South Dakota", "SD"),
   TENNESSEE("Tennessee", "TN"),
   TEXAS("Texas", "TX"),
   UTAH("Utah", "UT"),
   VERMONT("Vermont", "VT"),
   VIRGINIA("Virginia", "VA"),
   WASHINGTON("Washington", "WA"),
   WEST_VIRGINIA("West Virginia", "WV"),
   WISCONSIN("Wisconsin", "WI"),
   WYOMING("Wyoming", "WY"),
   UNKNOWN("Unknown", "UK");

   private String stateName;

   private String stateAbbreviation;

   State(final String newStateName, final String newStateAbbreviation)
   {
      this.stateName = newStateName;
      this.stateAbbreviation = newStateAbbreviation;
   }

   public String getStateName()
   {
      return this.stateName;
   }

   public String getStateAbbreviation()
   {
      return this.stateAbbreviation;
   }

   public static State fromAbbreviation(final String candidateAbbreviation)
   {
      State match = UNKNOWN;
      if (candidateAbbreviation != null && candidateAbbreviation.length() == 2)
      {
         final String upperAbbreviation = candidateAbbreviation.toUpperCase();
         for (final State state : State.values())
         {
            if (state.stateAbbreviation.equals(upperAbbreviation))
            {
               match = state;
            }
         }
      }
      return match;
   }
}

We can also provide an implementation of the TypeSerializer interface returned by the getSerializer() method shown above. That class implementing TypeSerializer is typically most easily written by extending one of the numerous existing implementations of TypeSerializer that Cassandra provides in the org.apache.cassandra.serializers package. In my example, my custom Serializer extends AbstractTextSerializer and the only method I need to add has the signature public void validate(final ByteBuffer bytes) throws MarshalException. Both of my custom classes need to provide a reference to an instance of themselves via static access. Here is the class that implements TypeSerializer via extension of AbstractTypeSerializer:

UnitedStatesStateSerializer.java - Implements TypeSerializer
package dustin.examples.cassandra.cqltypes;

import org.apache.cassandra.serializers.AbstractTextSerializer;
import org.apache.cassandra.serializers.MarshalException;

import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;

/**
 * Serializer for UnitedStatesState.
 */
public class UnitedStatesStateSerializer extends AbstractTextSerializer
{
   public static final UnitedStatesStateSerializer instance = new UnitedStatesStateSerializer();

   private UnitedStatesStateSerializer()
   {
      super(StandardCharsets.UTF_8);
   }

   /**
    * Validates provided ByteBuffer contents to ensure they can
    * be modeled in the UnitedStatesState Cassandra/CQL data type.
    * This allows for a full state name to be specified or for its
    * two-digit abbreviation to be specified and either is considered
    * valid.
    *
    * @param bytes ByteBuffer whose contents are to be validated.
    * @throws MarshalException Thrown if provided data is invalid.
    */
   @Override
   public void validate(final ByteBuffer bytes) throws MarshalException
   {
      try
      {
         final String stringFormat = new String(bytes.array()).toUpperCase();
         final State state =  stringFormat.length() == 2
                            ? State.fromAbbreviation(stringFormat)
                            : State.valueOf(stringFormat);
      }
      catch (Exception exception)
      {
         throw new MarshalException("Invalid model cannot be marshaled as UnitedStatesState.");
      }
   }
}

With the classes for creating a custom CQL data type written, they need to be compiled into .class files and archived in a JAR file. This process (compiling with javac -cp "C:\Program Files\DataStax Community\apache-cassandra\lib\*" -sourcepath src -d classes src\dustin\examples\cassandra\cqltypes\*.java and archiving the generated .class files into a JAR named CustomCqlTypes.jar with jar cvf CustomCqlTypes.jar *) is shown in the following screen snapshot.

The JAR with the class definitions of the custom CQL type classes needs to be placed in the Cassandra installation's lib directory as demonstrated in the next screen snapshot.

With the JAR containing the custom CQL data type classes implementations in the Cassandra installation's lib directory, Cassandra should be restarted so that it will be able to "see" these custom data type definitions.

The next code listing shows a Cassandra Query Language (CQL) statement for creating a table using the new custom type dustin.examples.cassandra.cqltypes.UnitedStatesState.

createAddress.cql
CREATE TABLE us_address
(
   id uuid,
   street1 text,
   street2 text,
   city text,
   state 'dustin.examples.cassandra.cqltypes.UnitedStatesState',
   zipcode text,
   PRIMARY KEY(id)
);

The next screen snapshot demonstrates the results of running the createAddress.cql code above by describing the created table in cqlsh.

The above screen snapshot demonstrates that the custom type dustin.examples.cassandra.cqltypes.UnitedStatesState is the type for the state column of the us_address table.

A new row can be added to the US_ADDRESS table with a normal INSERT. For example, the following screen snapshot demonstrates inserting an address with the command INSERT INTO us_address (id, street1, street2, city, state, zipcode) VALUES (blobAsUuid(timeuuidAsBlob(now())), '350 Fifth Avenue', '', 'New York', 'New York', '10118');:

Note that while the INSERT statement inserted "New York" for the state, it is stored as "NY".

If I run an INSERT statement in cqlsh using an abbreviation to start with (INSERT INTO us_address (id, street1, street2, city, state, zipcode) VALUES (blobAsUuid(timeuuidAsBlob(now())), '350 Fifth Avenue', '', 'New York', 'NY', '10118');), it still works as shown in the output shown below.

In my example, an invalid state does not prevent an INSERT from occurring, but instead persists the state as "UK" (for unknown) [see the implementation of this in UnitedStatesState.getStateAbbreviationAsByteBuffer(String)].

One of the first advantages that comes to mind justifying why one might want to implement a custom CQL datatype in Java is the ability to employ behavior similar to that provided by check constraints in relational databases. For example, in this post, my sample ensured that any state column entered for a new row was either one of the fifty states of the United States, the District of Columbia, or "UK" for unknown. No other values can be inserted into that column's value.

Another advantage of the custom data type is the ability to massage the data into a preferred form. In this example, I changed every state name to an uppercase two-digit abbreviation. In other cases, I might want to always store in uppercase or always store in lowercase or map finite sets of strings to numeric values. The custom CQL datatype allows for customized validation and representation of values in the Cassandra database.

Conclusion

This post has been an introductory look at implementing custom CQL datatypes in Cassandra. As I play with this concept more and try different things out, I hope to write another blog post on some more subtle observations that I make. As this post shows, it is fairly easy to write and use a custom CQL datatype, especially for Java developers.

1 comment:

HinD said...

Hi , I tried to implement this, but I am getting an error:
I get an error << ErrorMessage code=2000 [Syntax error in CQL query] message = "Error setting type cqltypes.UnitedStatesState: Unable to find abstract-type class 'cqltypes.UnitedStatesState' " >>


with only difference in our source code is package name: my package name is cqltypes.