Connecting Apache Solr with Tableau
Over the last few days we put up a demo to expose a sample dataset (indexed on Apache Solr
) and to be imported in Tableau
.
Overview
The objective is to setup a pipeline: upload data to Solr
, serve it via Gramex ProxyHandler
(+caching) and import in Tableau
.
Apache Solr
Using Solr
after having tried lunr
(dev friendly) and ElasticSearch
/Kibana
(more UI friendly) feels archaic.
Indexing and re-indexing
Uploading a file as a document and having it indexed in Solr
works OK after two tries. a) First, uploading a file crashed the workflow. b) After asking around, I then picked the file type and copy-pasted the content.
The trouble is with updating existing document. Read through to know more.
multiValues by default
For the uploaded document in a core, Solr
uses a field named multiValues by default which creates a JSON
item as below:
{
"ISO3":["CHN"],
"Name":["China"],
"Year":2014,
"id":["9de7a2fa-75bf-4f72-8a4c-460e56b96ad4"],
"Start_Date":["2014-05-08T00:00:00Z"],
"Event_Name":["100 year storm"],
"Hazard_Category":["Weather related"],
"Hazard_Type":["Storm"],
"New_Displacements":447000,
"_version_":1672372223938134016
},
note the values in array (it can have multiple values for each attribute). This is perfectly fine for lookups within Solr where each attribute supports multiple values. As we convert this object to a Pandas
dataframe it results in an object as below:
ISO3 | Name | Year | (more cols) |
---|---|---|---|
[“CHN”] | [“China”] | 2014 | (more vals) |
Writing this object to a Hyper extract would fail due to the data structure within each dataframe cell. It took us few moments of debugging to realize the issue. Since the data source is Solr
, it was tricky to figure out a fix for us (both of us were working with Solr
the first time).
Turns out, we had to update the document related schema.
Updating schema
I updated the uploaded document’s managed-schema
, schema.xml
and (see below) using the most helpful StackOverflow thread.solrconfig.xml
- in
managed-schema
file, find all instances ofmultiValued="true"
and replace withmultiValued="false"
- copy
managed-schema
toschema.xml
addPrashant (see below) confirms this step isn’t needed as it’ll create a new schema via dataloadhandler.<schemaFactory class="ClassicIndexSchemaFactory"/>
toschema.xml
andsolrconfig.xml
and our target JSON
item is as below:
{
"ISO3":"CHN",
"Name":"China",
"Year":2014,
"id":"9de7a2fa-75bf-4f72-8a4c-460e56b96ad4",
"Start_Date":"2014-05-08T00:00:00Z",
"Event_Name":"100 year storm",
"Hazard_Category":"Weather related",
"Hazard_Type":"Storm",
"New_Displacements":447000,
"_version_":1672372223938134016
},
This results in a Pandas
dataframe object as below:
ISO3 | Name | Year | (more cols) |
---|---|---|---|
“CHN” | “China” | 2014 | (more values) |
ProxyHandler in Gramex
Gramex
’s ProxyHandler is straightforward. Give it an API endpoint with any attributes and it fetches data. We used Solr
’s document query as input here. As Gramex
supports caching out of the box I configured it.
Connecting data to Tableau
We tried two approaches to integrate ProxyHandler
endpoint with Tableau
:
- via WDC, web data connector and
- using Hyper extract
WDC approach
Web Data Connector in Tableau
needs a HTML file that makes a connection to remote endpoint. We create a custom connector using JavaScript
and fetch the remote data source.
Once ready Tableau
should recognize the data table within the WDC and show data rows. However, that didn’t happen. Tableau showed one column (id
) and showed the rest columns as empty. It was perplexing with no error messages. After spending some time on it we tried the next approach. That said, we’re looking for approaches to handle this as it gives the benefit of caching via ProxyHandler
.
Hyper approach
Tableau
supports importing using its Hyper API. The default example is useful in creating a Hyper extract.
There are a few gotchas:
- it expects us to create the data schema upfront.
- it’s not a live connection. Any time data gets updated, Hyper file needs to be updated.
- ensure you define the Hyper schema in the same order of columns as in the dataframe.
- Hyper API needs
pip >= 19.3
for installation, so create your conda environment to ensure dependencies don’t clash.
This approach worked as expected, we were able to fetch data into Tableau.
Import Tableau Hyper dependencies
from tableauhyperapi import HyperProcess, Connection, TableDefinition, SqlType, Telemetry, Inserter, CreateMode, TableName
from tableauhyperapi import escape_string_literal
PATH_TO_HYPER = "my_hyper_file.hyper"
Create Hyper file
# Step 1: Start a new private local Hyper instance
with HyperProcess(Telemetry.SEND_USAGE_DATA_TO_TABLEAU, 'myapp' ) as hyper:
print("The HyperProcess has started.")
# Step 2: Create the the .hyper file, replace it if it already exists
with Connection(hyper.endpoint,
PATH_TO_HYPER,
create_mode=CreateMode.CREATE_AND_REPLACE,
) as connection:
print("The connection to the .hyper file is open.")
# Step 3: Create the schema
connection.catalog.create_schema('Extract')
#Step 4: Create the table definition
# ensure columns are added in the same order as the dataframe
schema = TableDefinition(table_name=TableName('Extract','Extract'),
columns=[
TableDefinition.Column('ISO3', SqlType.text()),
TableDefinition.Column('Name', SqlType.text()),
TableDefinition.Column('Year', SqlType.int()),
TableDefinition.Column('id', SqlType.text()),
TableDefinition.Column('Start_Date', SqlType.text()),
TableDefinition.Column('Event_Name', SqlType.text()),
TableDefinition.Column('Hazard_Category', SqlType.text()),
TableDefinition.Column('Hazard_Type', SqlType.text()),
TableDefinition.Column('New_Displacements', SqlType.int()),
TableDefinition.Column('_version_', SqlType.big_int())
])
print("The table is defined.")
# Step 5: Create the table in the connection catalog
connection.catalog.create_table(schema)
with Inserter(connection, schema) as inserter:
# df is a dataframe of rows from proxyhandler endpoint (JSON)
for index, row in df.iterrows():
print(row)
inserter.add_row(row)
inserter.execute()
print("The data was added to the table.")
print("The HyperProcess has shut down.")
print("The connection to the Hyper file is closed.")
Conclusion
I added this section after a note from Pratap.
To clarify, since WDC isn’t being used currently (in preference to Hyper extracts) due to an integration bug, adding a layer of Gramex ProxyHandler
is unnecessary as Hyper extract isn’t a live connection.
Notes
I worked with Prashant PK [LinkedIn], a colleague at Gramener on this. He took care of the WDC, Hyper configurations and Tableau
. I created Solr
and Gramex ProxyHandler
endpoints and assisted with debugging. Sagar [LinkedIn], our infrastructure specialist, helped install Solr
on the server.