Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBase Spark Demo #19

Open
4 of 10 tasks
Jimvin opened this issue Aug 30, 2022 · 6 comments
Open
4 of 10 tasks

HBase Spark Demo #19

Jimvin opened this issue Aug 30, 2022 · 6 comments
Assignees

Comments

@Jimvin
Copy link
Member

Jimvin commented Aug 30, 2022

Loading data into HBase is not trivial. We want the demo to show how this can be done and to provide some guidance and best practice.

Aims

Tasks

  • Load data into HDFS from S3
  • Parse CSV and create HFiles
  • Load incremental HFiles into HBase
  • Load a streaming data source into HBase
  • Stackable cluster configuration
  • Verify the data is there (sanity check) using HBase shell
  • Create Phoenix view over table
  • Configure Phoenix as a data source in SuperSet
  • Create a visualisation using Phoenix JDBC and SuperSet
  • Query HBase using Spark HBase connector

## Learning Points and Challenges

  • Where does DistCP and HBase bulk load run, given there is no YARN cluster?
  • Are these jobs scalable?
  • Can we go near real time dashboards in Grafana and see instant updates
  • Stress testing
  • Test HBase region management - can we watch this in real time as part of a demo?
@snocke
Copy link

snocke commented Sep 2, 2022

Choose your JAVA version first. In october 2022 it only compiles and tests successfully with JAVA8. However, we depend on JAVA11 in our images.

mvn -Dspark.version=3.3.0 -Dscala.version=2.12.14 -Dhadoop-three.version=3.3.2 -Dscala.binary.version=2.12 -Dhbase.version=2.4.12 -DrecompileMode=all clean package

.jars can be found in Nexus

@snocke
Copy link

snocke commented Sep 2, 2022

This shows how to access hbase using spark shell https://kontext.tech/article/628/spark-connect-to-hbase

@snocke
Copy link

snocke commented Sep 2, 2022

Hi @Jimvin,
in case you want to continue the hbase-spark-connector test during my holiday you will find the status quo on branch 87 in stackablectl

@snocke
Copy link

snocke commented Sep 28, 2022

After updating the hbase connector repo my maven build fails:

[INFO] --- gmaven-plugin:1.5:execute (default) @ hbase-spark ---
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache HBase - Spark 1.0.1-SNAPSHOT:
[INFO]
[INFO] Apache HBase - Spark ............................... SUCCESS [  3.120 s]
[INFO] Apache HBase - Spark Protocol ...................... SUCCESS [  3.778 s]
[INFO] Apache HBase - Spark Protocol (Shaded) ............. SUCCESS [  1.922 s]
[INFO] Apache HBase - Spark Connector ..................... FAILURE [  4.405 s]
[INFO] Apache HBase - Spark Integration Tests ............. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  13.905 s
[INFO] Finished at: 2022-09-27T22:34:54+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.gmaven:gmaven-plugin:1.5:execute (default) on project hbase-spark: Execution default of goal org.codehaus.gmaven:gmaven-plugin:1.5:execute failed: An API incompatibility was encountered while executing org.codehaus.gmaven:gmaven-plugin:1.5:execute: java.lang.ExceptionInInitializerError: null
[ERROR] -----------------------------------------------------
[ERROR] realm =    plugin>org.codehaus.gmaven:gmaven-plugin:1.5
[ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy
[ERROR] urls[0] = file:/Users/Simon/.m2/repository/org/codehaus/gmaven/gmaven-plugin/1.5/gmaven-plugin-1.5.jar
[ERROR] urls[1] = file:/Users/Simon/.m2/repository/org/codehaus/gmaven/runtime/gmaven-runtime-api/1.5/gmaven-runtime-api-1.5.jar
[ERROR] urls[2] = file:/Users/Simon/.m2/repository/org/codehaus/gmaven/feature/gmaven-feature-api/1.5/gmaven-feature-api-1.5.jar
[ERROR] urls[3] = file:/Users/Simon/.m2/repository/org/codehaus/gmaven/runtime/gmaven-runtime-loader/1.5/gmaven-runtime-loader-1.5.jar
[ERROR] urls[4] = file:/Users/Simon/.m2/repository/org/codehaus/gmaven/feature/gmaven-feature-support/1.5/gmaven-feature-support-1.5.jar
[ERROR] urls[5] = file:/Users/Simon/.m2/repository/org/codehaus/gmaven/runtime/gmaven-runtime-support/1.5/gmaven-runtime-support-1.5.jar
[ERROR] urls[6] = file:/Users/Simon/.m2/repository/org/sonatype/gshell/gshell-io/2.4/gshell-io-2.4.jar
[ERROR] urls[7] = file:/Users/Simon/.m2/repository/org/codehaus/plexus/plexus-utils/3.0/plexus-utils-3.0.jar
[ERROR] urls[8] = file:/Users/Simon/.m2/repository/com/thoughtworks/qdox/qdox/1.12/qdox-1.12.jar
[ERROR] urls[9] = file:/Users/Simon/.m2/repository/org/apache/maven/shared/file-management/1.2.1/file-management-1.2.1.jar
[ERROR] urls[10] = file:/Users/Simon/.m2/repository/org/apache/maven/shared/maven-shared-io/1.1/maven-shared-io-1.1.jar
[ERROR] urls[11] = file:/Users/Simon/.m2/repository/org/apache/xbean/xbean-reflect/3.4/xbean-reflect-3.4.jar
[ERROR] urls[12] = file:/Users/Simon/.m2/repository/log4j/log4j/1.2.12/log4j-1.2.12.jar
[ERROR] urls[13] = file:/Users/Simon/.m2/repository/commons-logging/commons-logging-api/1.1/commons-logging-api-1.1.jar
[ERROR] urls[14] = file:/Users/Simon/.m2/repository/com/google/collections/google-collections/1.0/google-collections-1.0.jar
[ERROR] urls[15] = file:/Users/Simon/.m2/repository/org/apache/maven/reporting/maven-reporting-impl/2.0.4.1/maven-reporting-impl-2.0.4.1.jar
[ERROR] urls[16] = file:/Users/Simon/.m2/repository/org/codehaus/plexus/plexus-interpolation/1.1/plexus-interpolation-1.1.jar
[ERROR] urls[17] = file:/Users/Simon/.m2/repository/commons-validator/commons-validator/1.2.0/commons-validator-1.2.0.jar
[ERROR] urls[18] = file:/Users/Simon/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar
[ERROR] urls[19] = file:/Users/Simon/.m2/repository/commons-digester/commons-digester/1.6/commons-digester-1.6.jar
[ERROR] urls[20] = file:/Users/Simon/.m2/repository/commons-logging/commons-logging/1.0.4/commons-logging-1.0.4.jar
[ERROR] urls[21] = file:/Users/Simon/.m2/repository/oro/oro/2.0.8/oro-2.0.8.jar
[ERROR] urls[22] = file:/Users/Simon/.m2/repository/xml-apis/xml-apis/1.0.b2/xml-apis-1.0.b2.jar
[ERROR] urls[23] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-core/1.0-alpha-10/doxia-core-1.0-alpha-10.jar
[ERROR] urls[24] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-sink-api/1.0-alpha-10/doxia-sink-api-1.0-alpha-10.jar
[ERROR] urls[25] = file:/Users/Simon/.m2/repository/org/apache/maven/reporting/maven-reporting-api/2.0.4/maven-reporting-api-2.0.4.jar
[ERROR] urls[26] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-site-renderer/1.0-alpha-10/doxia-site-renderer-1.0-alpha-10.jar
[ERROR] urls[27] = file:/Users/Simon/.m2/repository/org/codehaus/plexus/plexus-i18n/1.0-beta-7/plexus-i18n-1.0-beta-7.jar
[ERROR] urls[28] = file:/Users/Simon/.m2/repository/org/codehaus/plexus/plexus-velocity/1.1.7/plexus-velocity-1.1.7.jar
[ERROR] urls[29] = file:/Users/Simon/.m2/repository/org/apache/velocity/velocity/1.5/velocity-1.5.jar
[ERROR] urls[30] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-decoration-model/1.0-alpha-10/doxia-decoration-model-1.0-alpha-10.jar
[ERROR] urls[31] = file:/Users/Simon/.m2/repository/commons-collections/commons-collections/3.2/commons-collections-3.2.jar
[ERROR] urls[32] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-module-apt/1.0-alpha-10/doxia-module-apt-1.0-alpha-10.jar
[ERROR] urls[33] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-module-fml/1.0-alpha-10/doxia-module-fml-1.0-alpha-10.jar
[ERROR] urls[34] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-module-xdoc/1.0-alpha-10/doxia-module-xdoc-1.0-alpha-10.jar
[ERROR] urls[35] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-module-xhtml/1.0-alpha-10/doxia-module-xhtml-1.0-alpha-10.jar
[ERROR] urls[36] = file:/Users/Simon/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar
[ERROR] urls[37] = file:/Users/Simon/.m2/repository/org/sonatype/gossip/gossip/1.2/gossip-1.2.jar
[ERROR] Number of foreign imports: 1
[ERROR] import: Entry[import  from realm ClassRealm[project>org.apache.hbase.connectors:spark:1.0.1-SNAPSHOT, parent: ClassRealm[maven.api, parent: null]]]

@snocke
Copy link

snocke commented Sep 28, 2022

When executing the spark-k8s application I'm currently receiving an error. The error looks like a system error (ARM/x86)

++ id -u
+ myuid=1000
++ id -g
+ mygid=0
+ set +e
++ getent passwd 1000
+ uidentry=stackable:x:1000:1000::/stackable:/bin/bash
+ set -e
+ '[' -z stackable:x:1000:1000::/stackable:/bin/bash ']'
+ '[' -z /usr/lib/jvm/jre-11 ']'
+ SPARK_CLASSPATH=':/stackable/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/stackable/spark/jars/*'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /stackable/spark/bin/spark-submit --conf spark.driver.bindAddress=10.244.1.38 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class tech.stackable.demo.spark local:////Users/Simon/Repo/stackable/stackablectl/demos/hbase-hdfs-load-cycling-data/sparkHbaseAccess/target/sparkHbaseAccess-1.0-SNAPSHOT.jar --hbaseSite /arguments/hbase-site.xml --tableName cycling-tripdata
qemu-x86_64: Could not open '/lib64/ld-linux-x86-64.so.2': No such file or directory

@snocke
Copy link

snocke commented Oct 17, 2022

This ticket is on hold.
We need a strategy to get the hbase-spark-connector working with JAVA 11.
The current status is saved on this branch

  • build on top of the hbase-hdfs-cycling-demo
  • A job copying a .jar from our nexus to S3 (Minio)
  • Creating Secrets for access
  • Java Spark Application to simply scan a Hbase table
  • Mounting Hbase config to spark
  • TODO: Build with JAVA11 https://github.com/apache/hbase-connectors/tree/master/spark
  • TODO: Publish hbase-spark-connector.jar to our nexus and add it as dependency to pom for java project
  • TODO: Test, if hbase-spark-connector.jar needs to be distributed to hbase region servers or if it is not needed. (see configuration from repo)

@snocke snocke changed the title HBase Demo HBase Spark Demo Oct 17, 2022
@lfrancke lfrancke transferred this issue from stackabletech/stackablectl Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants