Spark Redshift with Python -
i'm trying connect spark amazon redshift i'm getting error :
my code follow :
from pyspark.sql import sqlcontext pyspark import sparkcontext sc = sparkcontext(appname="connect spark redshift") sql_context = sqlcontext(sc) sc._jsc.hadoopconfiguration().set("fs.s3n.awsaccesskeyid", <accessid>) sc._jsc.hadoopconfiguration().set("fs.s3n.awssecretaccesskey", <accesskey>) df = sql_context.read \ .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \ .option("dbtable", "table_name") \ .option("tempdir", "bucket") \ .load()
here step step process connecting redshift.
- download redshift connector file . try below command
wget "https://s3.amazonaws.com/redshift-downloads/drivers/redshiftjdbc4-1.2.1.1001.jar"
- save below code in python file(.py want run) , replace credentials accordingly.
from pyspark.conf import sparkconf pyspark.sql import sparksession #initialize spark session spark = sparksession.builder.master("yarn").appname("connect redshift").enablehivesupport().getorcreate() sc = spark.sparkcontext sqlcontext = hivecontext(sc) sc._jsc.hadoopconfiguration().set("fs.s3.awsaccesskeyid", "<accesskeyid>") sc._jsc.hadoopconfiguration().set("fs.s3.awssecretaccesskey", "<accesskeysectret>") taxonomydf = sqlcontext.read \ .format("com.databricks.spark.redshift") \ .option("url", "jdbc:postgresql://url.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx") \ .option("dbtable", "table_name") \ .option("tempdir", "s3://mybucket/") \ .load()
- run spark-submit below
spark-submit --packages com.databricks:spark-redshift_2.10:0.5.0 --jars redshiftjdbc4-1.2.1.1001.jar test.py
Comments
Post a Comment