Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong timestamp unit when reading from parquet #218

Open
ancher1912 opened this issue Oct 5, 2023 · 0 comments
Open

Wrong timestamp unit when reading from parquet #218

ancher1912 opened this issue Oct 5, 2023 · 0 comments

Comments

@ancher1912
Copy link

ancher1912 commented Oct 5, 2023

When I write a parquet file using arrow and then read it using DuckDB the unit of the timestamp is incorrect. I'm not sure if this is a duckdb-rs issue or a duckdb issue. But I decided to submit it here first.

Code to generate the arrow file:

let schema = Schema::new(
        vec![
            Field::new("time", DataType::Timestamp(TimeUnit::Nanosecond, None), true),
            Field::new("value", DataType::Float64, true),
            Field::new("valid", DataType::Boolean, true),
        ]
    );

    let n = 1000000;
    let timestamps : Vec<Option<i64>> = (0..n).map(|x| Some(x)).collect();
    let values : Vec<Option<f64>> = (0..n).map(|x| Some((x as f64).sin())).collect();
    let validities :Vec<Option<bool>> = vec![Some(true); n as usize];

    let batch = RecordBatch::try_new(
        Arc::new(schema),
        vec![
            Arc::new(TimestampNanosecondArray::from(timestamps)),
            Arc::new(Float64Array::from(values)),
            Arc::new(BooleanArray::from(validities)),
        ]
    ).expect("Failed to make batch for writing to Parquet test");

    let file = File::create("tvv.parquet").unwrap();
    let props = WriterProperties::builder()
        .set_compression(Compression::UNCOMPRESSED)
        .build();
    let mut writer = ArrowWriter::try_new(file, batch.schema(), Some(props)).unwrap();
    writer.write(&batch).expect("Writing batch");
    writer.close().unwrap();

Code to read the file:

    let conn = Connection::open_in_memory().unwrap();
    let mut stmt = conn.prepare("SELECT * FROM tvv.parquet").unwrap();
    let query  = stmt.query_arrow([]).unwrap();
    let schema = query.get_schema();
    println!("{:?}", schema);

When I read the Parquet file using a some tool like Parquet Viewer, the schema is as I expect, with Nanosecond unit. But the output of my code is:

Schema { fields: [Field { name: "time", data_type: Timestamp(Microsecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "valid", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }

I.e. Microsecond unit.

Is this a bug? Or am I doing something wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant