Wrong timestamp unit when reading from parquet #218

ancher1912 · 2023-10-05T16:47:22Z

When I write a parquet file using arrow and then read it using DuckDB the unit of the timestamp is incorrect. I'm not sure if this is a duckdb-rs issue or a duckdb issue. But I decided to submit it here first.

Code to generate the arrow file:

let schema = Schema::new(
        vec![
            Field::new("time", DataType::Timestamp(TimeUnit::Nanosecond, None), true),
            Field::new("value", DataType::Float64, true),
            Field::new("valid", DataType::Boolean, true),
        ]
    );

    let n = 1000000;
    let timestamps : Vec<Option<i64>> = (0..n).map(|x| Some(x)).collect();
    let values : Vec<Option<f64>> = (0..n).map(|x| Some((x as f64).sin())).collect();
    let validities :Vec<Option<bool>> = vec![Some(true); n as usize];

    let batch = RecordBatch::try_new(
        Arc::new(schema),
        vec![
            Arc::new(TimestampNanosecondArray::from(timestamps)),
            Arc::new(Float64Array::from(values)),
            Arc::new(BooleanArray::from(validities)),
        ]
    ).expect("Failed to make batch for writing to Parquet test");

    let file = File::create("tvv.parquet").unwrap();
    let props = WriterProperties::builder()
        .set_compression(Compression::UNCOMPRESSED)
        .build();
    let mut writer = ArrowWriter::try_new(file, batch.schema(), Some(props)).unwrap();
    writer.write(&batch).expect("Writing batch");
    writer.close().unwrap();

Code to read the file:

    let conn = Connection::open_in_memory().unwrap();
    let mut stmt = conn.prepare("SELECT * FROM tvv.parquet").unwrap();
    let query  = stmt.query_arrow([]).unwrap();
    let schema = query.get_schema();
    println!("{:?}", schema);

When I read the Parquet file using a some tool like Parquet Viewer, the schema is as I expect, with Nanosecond unit. But the output of my code is:

Schema { fields: [Field { name: "time", data_type: Timestamp(Microsecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "valid", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }

I.e. Microsecond unit.

Is this a bug? Or am I doing something wrong?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong timestamp unit when reading from parquet #218

Wrong timestamp unit when reading from parquet #218

ancher1912 commented Oct 5, 2023 •

edited

Wrong timestamp unit when reading from parquet #218

Wrong timestamp unit when reading from parquet #218

Comments

ancher1912 commented Oct 5, 2023 • edited

ancher1912 commented Oct 5, 2023 •

edited