Bigquery Client For Mac

среда 22 январяadmin

Works for PostgreSQL, MySQL, BigQuery, SQL Server, Redshift, Snowflake, SQLite, Presto, Cassandra, Oracle, ODBC, Panoply, MongoDB, Athena, and more. Using a Google User Account. You can configure the driver to authenticate the connection with a Google user account. This authentication method uses the OAuth 2.0 access and refresh tokens associated with the user account as the credentials.

Latest version

Released:

BigQuery schema generator from JSON or CSV data

Project description

This script generates the BigQuery schema from the newline-delimiteddata records on the STDIN. The records can be in JSON format or CSVformat. The BigQuery data importer (bq load) uses only the first 100lines when the schema auto-detection feature is enabled. In contrast,this script uses all data records to generate the schema.

Usage:

Version: 0.5.1 (2019-06-19)

Background

Data can be imported intoBigQuery using thebq commandline tool. It accepts a number of data formats including CSV ornewline-delimited JSON. The data can be loaded into an existing table ora new table can be created during the loading process. The structure ofthe table is defined by itsschema. The table’sschema can be defined manually or the schema can beauto-detected.

When the auto-detect feature is used, the BigQuery data importerexamines only the first 100records of theinput data. In many cases, this is sufficient because the data recordswere dumped from another database and the exact schema of the sourcetable was known. However, for data extracted from a service (e.g. usinga REST API) the record fields could have been organically added at laterdates. In this case, the first 100 records do not contain fields whichare present in later records. The bq load auto-detection fails andthe data fails to load.

The bq load tool does not support the ability to process the entiredataset to determine a more accurate schema. This script fills in thatgap. It processes the entire dataset given in the STDIN and outputs theBigQuery schema in JSON format on the STDOUT. This schema file can befed back into the bq load tool to create a table that is morecompatible with the data fields in the input dataset.

Installation

Install from PyPI repository usingpip3. If you want to install the package for your entire systemglobally, use

If you are using a virtual environment (such asvenv), then you don’tneed the sudo coommand, and you can just type:

A successful install should print out something like the following (theversion number may be different):

The shell script generate-schema is installed in the same directoryas pip3.

Ubuntu Linux

Under Ubuntu Linux, you should find the generate-schema script at/usr/local/bin/generate-schema.

MacOS

If you installed Python from Python Releases for Mac OSX, then/usr/local/bin/pip3 is a symlink to/Library/Frameworks/Python.framework/Versions/3.6/bin/pip3. Sogenerate-schema is installed at/Library/Frameworks/Python.framework/Versions/3.6/bin/generate-schema.

The Python installer updates $HOME/.bash_profile to add/Library/Frameworks/Python.framework/Versions/3.6/bin to the$PATH environment variable. So you should be able to run thegenerate-schema command without typing in the full path.

Usage

The generate_schema.py script accepts a newline-delimited JSON orCSV data file on the STDIN. JSON input format has been testedextensively. CSV input format was added more recently (in v0.4) usingthe --input_format csv flag. The support is not as robust as JSONfile. For example, CSV format supports only the comma-separator, anddoes not support the pipe () or tab (t) character.

Unlike bq load, the generate_schema.py script reads every recordin the input data file to deduce the table’s schema. It prints the JSONformatted schema file on the STDOUT.

There are at least 3 ways to run this script:

1) Shell script

If you installed using pip3, then it should have installed a smallhelper script named generate-schema in your local ./bindirectory of your current environment (depending on whether you areusing a virtual environment).

2) Python module

You can invoke the module directly using:

This is essentially what the generate-schema command does.

3) Python script

If you retrieved this code from its GitHubrepository,then you can invoke the Python script directly:

Using the Schema Output

The resulting schema file can be given to the bq load command usingthe --schema flag:

where mydataset.mytable is the target table in BigQuery.

For debugging purposes, here is the equivalent bq load command usingschema autodetection:

If the input file is in CSV format, the first line will be the headerline which enumerates the names of the columns. But this header linemust be skipped when importing the file into the BigQuery table. Weaccomplish this using --skip_leading_rows flag:

Here is the equivalent bq load command for CSV files usingautodetection:

A useful flag for bq load, particularly for JSON files, is--ignore_unknown_values, which causes bq load to ignore fieldsin the input data which are not defined in the schema. Whengenerate_schema.py detects an inconsistency in the definition of aparticular field in the input data, it removes the field from the schemadefinition. Without the --ignore_unknown_values, the bq loadfails when the inconsistent data record is read.

Another useful flag during development and debugging is --replacewhich replaces any existing BigQuery table.

After the BigQuery table is loaded, the schema can be retrieved using:

(The python -m json.tool command will pretty-print the JSONformatted schema file. An alternative is the jqcommand.) The resulting schema fileshould be identical to file.schema.json.

Flag Options

The generate_schema.py script supports a handful of command lineflags as shown by the --help flag below.

Input Format (`--input_format`)

Specifies the format of the input file, either json (default) orcsv.

If csv file is specified, the --keep_nulls flag is automaticallyactivated. This is required because CSV columns are definedpositionally, so the schema file must contain all the columns specifiedby the CSV file, in the same order, even if the column contains an emptyvalue for every record.

See Issue#26for implementation details.

Keep Nulls (`--keep_nulls`)

Normally when the input data file contains a field which has a null,empty array or empty record as its value, the field is suppressed in theschema file. This flag enables this field to be included in the schemafile.

In other words, using a data file containing just nulls and emptyvalues:

With the keep_nulls flag, we get:

Quoted Values Are Strings (`--quoted_values_are_strings`)

By default, quoted values are inspected to determine if they can beinterpreted as DATE, TIME, TIMESTAMP, BOOLEAN,INTEGER or FLOAT. This is consistent with the algorithm used bybq load. However, for the BOOLEAN, INTEGER, or FLOATtypes, it is sometimes more useful to interpret those as normal stringsinstead. This flag disables type inference for BOOLEAN, INTEGERand FLOAT types inside quoted strings.

Infer Mode (`--infer_mode`)

Set the schema mode of a field to REQUIRED instead of thedefault NULLABLE if the field contains a non-null or non-empty valuefor every data record in the input file. This option is available onlyfor CSV (--input_format csv) files. It is theoretically possible toimplement this feature for JSON files, but too difficult to implement inpractice because fields are often completely missing from a given JSONrecord (instead of explicitly being defined to be null).

See Issue#28for implementation details.

Debugging Interval (`--debugging_interval`)

By default, the generate_schema.py script prints a short progressmessage every 1000 lines of input data. This interval can be changedusing the --debugging_interval flag.

Debugging Map (`--debugging_map`)

Instead of printing out the BigQuery schema, the --debugging_mapprints out the bookkeeping metadata map which is used internally to keeptrack of the various fields and their types that were inferred using thedata file. This flag is intended to be used for debugging.

Sanitize Names (`--sanitize_names`)

BigQuery column names are restricted to certain characters and length.With this flag, column names are sanitizes so that any character outsideof ASCII letters, numbers and underscore ([a-zA-Z0-9_]) areconverted to an underscore. (For example “go&2#there!” is converted to“go_2_there_”.) Names longer than 128 characters are truncated to128.

Schema Types

Supported Types

The bq show –schema command produces a JSON schema file that usesthe older Legacy SQL datetypes. Forcompatibility, generate-schema script will also generate a schemafile using the legacy data types.

The supported types are:

BOOLEAN
INTEGER
FLOAT
STRING
TIMESTAMP
DATE
TIME
RECORD

The generate-schema script supports both NULLABLE andREPEATED modes of all of the above types.

The supported format of TIMESTAMP is as close as practical to thebq loadformat:

which appears to be an extension of the ISO 8601format. The difference frombq load is that the [time zone] component can be only * Z* UTC (same as Z) * (+ -)H[H][:M[M]]

Note that BigQuery supports up to 6 decimal places after the integer‘second’ component. generate-schema follows the same restriction forcompatibility. If your input file contains more than 6 decimal places,you need to write a data cleansing filter to fix this.

The suffix UTC is not standard ISO 8601 nor documented byGooglebut the UTC suffix is used by bq extract and the web interface.(See Issue19.)

Timezone names from the tz database(e.g. “America/Los_Angeles”) are not supported bygenerate-schema.

The following types are not supported at all:

BYTES
DATETIME (unable to distinguish from TIMESTAMP)

Type Inferrence Rules

The generate-schema script attempts to emulate the various typeconversion and compatibility rules implemented by bq load:

INTEGER can upgrade to FLOAT
- if a field in an early record is an INTEGER, but a subsequentrecord shows this field to have a FLOAT value, the type of thefield will be upgraded to a FLOAT
- the reverse does not happen, once a field is a FLOAT, it willremain a FLOAT
conflicting TIME, DATE, TIMESTAMP types upgrades toSTRING
- if a field is determined to have one type of “time” in one record,then subsequently a different “time” type, then the field will beassigned a STRING type
NULLABLE RECORD can upgrade to a REPEATED RECORD
- a field may be defined as RECORD (aka “Struct”) type with{ .. }
- if the field is subsequently read as an array with a[{ .. }], the field is upgraded to a REPEATED RECORD
a primitive type (FLOAT, INTEGER, STRING) cannot upgradeto a REPEATED primitive type
- there’s no technical reason why this cannot be allowed, but bqload does not support it, so we follow its behavior
a DATETIME field is always inferred to be a TIMESTAMP
- the format of these two fields is identical (in the absence oftimezone)
- we follow the same logic as bq load and always infer these asTIMESTAMP
BOOLEAN, INTEGER, and FLOAT can appear inside quotedstrings
- In other words, 'true' (or 'True' or 'false', etc) isconsidered a BOOLEAN type, '1' is considered an INTEGER type,and '2.1' is considered a FLOAT type. Luigi Mori(jtschichold@) added additional logic to replicate the typeconversion logic used by bq load for these strings.
- This type inference inside quoted strings can be disabled usingthe --quoted_values_are_strings flag
- (See Issue#22for more details.)
INTEGER values overflowing a 64-bit signed integer upgrade toFLOAT
- integers greater than 2^63-1 (9223372036854775807)
- integers less than -2^63 (-9223372036854775808)
- (See Issue#18for more details)

Examples

Here is an example of a single JSON data record on the STDIN (the ^Dbelow means typing Control-D, which indicates “end of file” under Linuxand MacOS):

In most cases, the data file will be stored in a file:

Here is the schema generated from a CSV input file. The first line isthe header containing the names of the columns, and the schema lists thecolumns in the same order as the header:

Here is an example of the schema generated with the --infer_modeflag:

Using As a Library

The bigquery_schema_generator module can be used as a library by anexternal Python client code by creating an instance ofSchemaGenerator and calling the run(input, output) method:

If you need to process the generated schema programmatically, use thededuce_schema() method and process the resulting schema_map anderror_log data structures like this:

Benchmarks

I wrote the bigquery_schema_generator/anonymize.py script to createan anonymized data file tests/testdata/anon1.data.json.gz:

Download FIFA 18 - STEAMPUNKS Mac OSX torrent or any other torrent from the Applications Mac. Direct download via magnet link. FIFA 18 is a sports game that simulates association football. The game features 52 fully licensed stadiums from 12 countries, including new stadiums, plus 30 generic fields equals to a total of 82. All 20 Premier League stadiums are represented in the series. Fifa 18 torrent for mac pc. If one completed the original story, they will begin FIFA 18 at the same club – with traits and honours, such as a Premier League title or FA Cup win are carried over. As for those starting afresh they will see a montage of key plot points, then be able to select from any current English Premier League side.-FIFA 18 Mac OS Download Free.

This data file is 290MB (5.6MB compressed) with 103080 data records.

Generating the schema using

took 67s on a Dell Precision M4700 laptop with an Intel Core i7-3840QMCPU @ 2.80GHz, 32GB of RAM, Ubuntu Linux 18.04, Python 3.6.7.

System Requirements

This project was initially developed on Ubuntu 17.04 using Python 3.5.3.I have tested it on:

Ubuntu 18.04, Python 3.6.7
Ubuntu 17.10, Python 3.6.3
Ubuntu 17.04, Python 3.5.3
Ubuntu 16.04, Python 3.5.2
MacOS 10.14.2, Python3.6.4
MacOS 10.13.2, Python3.6.4

Authors

Created by Brian T. Park (brian@xparks.net).
Type inference inside quoted strings by Luigi Mori (jtschichold@).
Flag to disable type inference inside quoted strings by Daniel Ecer(de-code@).
Support for CSV files and detection of REQUIRED fields by SandorKorotkevics (korotkevics@).
Better support for using bigquery_schema_generator as a libraryfrom an external Python code by StefanoG_ITA (StefanoGITA@).
Sanitizing of column names to valid BigQuery characters and length byJon Warghed (jonwarghed@).

Project details

Release historyRelease notifications

0.5.1

0.5

0.4

0.3.2

0.3.1

0.3

0.2.1

0.2.0

0.1.6

0.1.5

0.1.4

0.1.3

0.1.2

0.1.1

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for bigquery-schema-generator, version 0.5.1
Filename, size	File type	Python version	Upload date	Hashes
Filename, size bigquery-schema-generator-0.5.1.tar.gz (25.4 kB)	File type Source	Python version None	Upload date	Hashes

Hashes for bigquery-schema-generator-0.5.1.tar.gz

Hashes for bigquery-schema-generator-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`bf02e747e7dfa7a393d1c5e981a3258c36dbda21998bbc4fc6020219190f1fca`
MD5	`1022895b5c8a4150e0e743ff420f434e`
BLAKE2-256	`55aea9b8a6ecc0c8224536b7ac3538522f1848c017268e79f5e3206026c64775`