Tez orc snappy compression issues

1/4/2024

Tez orc snappy compression issues

Read Now

You can see that the file has 66 stripes ranging from 70 MB (50 MB for the last tailing stripe) to 150 MB. $ hive -orcfiledump s3://cloudsqale/hive/clicks.db/event_dt=/part-m-00029 Let’s take a single file and examine its structure: $ aws s3 ls s3://cloudsqale/hive/clicks.db/event_dt=/ -summarize I have 143 GB of daily data for clicks located in 33 files in ORC format: When the data are inserted with hive/tez I can query them from Presto without any error, and of course I get the same number of lines with both method.In the previous article I already wrote about splits generation (see Tez Internals #2 – Number of Map Tasks for Large ORC Files), and here I would like to share some more details. I was thinking that maybe some of the data in the original parquet file were creating the issue, so I created the same exact table but then inserted the data with hive/Tez (that process is also going fine without any error). If I do a select columns1,columns2 with the same predicate the query will succeed. Seek past end of stream Īt io.seekToCheckpoint(CompressedOrcChunkLoader.java:95)Īt io.seekToCheckpoint(OrcInputStream.java:175)Īt io.seekToCheckpoint(ByteArrayInputStream.java:50)Īt io.seekToCheckpoint(ByteArrayInputStream.java:22)Īt io.openStream(CheckpointInputStreamSource.java:57)Īt io.openRowGroup(SliceDirectStreamReader.java:243)Īt io.readBlock(SliceDirectStreamReader.java:104)Īt io.readBlock(SliceStreamReader.java:77)Īt io.(OrcRecordReader.java:410)Īt io.OrcPageSource$OrcBlockLoader.load(OrcPageSource.java:231) Seek past end of stream Īt io.OrcPageSource$OrcBlockLoader.load(OrcPageSource.java:235)Īt io.OrcPageSource$OrcBlockLoader.load(OrcPageSource.java:209)Īt io.assureLoaded(LazyBlock.java:277)Īt io.getLoadedBlock(LazyBlock.java:268)Īt io.$recordMaterializedBytes$0(PageUtils.java:42)Īt io.$tupDictionaryBlockProjection(DictionaryAwarePageProjection.java:211)Īt io.$DictionaryAwarePageProjectionWork.lambda$getResult$0(DictionaryAwarePageProjection.java:197)Īt io.$ProjectSelectedPositions.processBatch(PageProcessor.java:341)Īt io.$ProjectSelectedPositions.process(PageProcessor.java:204)Īt io.$ProcessWorkProcessor.process(WorkProcessorUtils.java:373)Īt io.$flatten$6(WorkProcessorUtils.java:278)Īt io.$3.process(WorkProcessorUtils.java:320)Īt io.$3.process(WorkProcessorUtils.java:307)Īt io.(WorkProcessorUtils.java:221)Īt io.$processStateMonitor$2(WorkProcessorUtils.java:200)Īt io.$finishWhen$3(WorkProcessorUtils.java:215)Īt io.(WorkProcessorSourceOperatorAdapter.java:148)Īt io.(Driver.java:379)Īt io.$processFor$8(Driver.java:283)Īt io.(Driver.java:675)Īt io.(Driver.java:276)Īt io.$DriverSplitRunner.processFor(SqlTaskExecution.java:1075)Īt io.process(PrioritizedSplitRunner.java:163)Īt io.$n(TaskExecutor.java:484)Īt io.prestosql.$gen.Presto_318_20190829_120913_1.run(Unknown Source)Īt .runWorker(ThreadPoolExecutor.java:1149)Īt $n(ThreadPoolExecutor.java:624)Ĭaused by: io.: Malformed ORC file. The orc file are created, however if now I try to do a Select * with a predicate on that table I will get this: INSERT INTO xxxx_prod_log_orc SELECT * FROM xxxx_prod_log_parquet WHERE logdate=('') I then inserted data from my original table (parquet files) with Presto and this doesn't result in any error:

This is an internal hive table and here is the end of the create table in hive: I'm now trying to create a new table with Presto that take the original parquet file and keep the partition but that add buckets and a bloom filter. I have absolutely no issue to query this external table (through hive) with Presto. I have a set of parquet file partitioned by date containing about 50 millions records per day with about 800 columns. I will try to summarize the issue as best as I can but unfortunately I'm not able to copy some data to illustrate the issue.

0 Comments

Tez orc snappy compression issues

Leave a Reply.

Author

Archives

Categories